TEXT TRANSLATION WITH MARKER
  • Home

On this page

  • Abstract
  • 1. Introduction
  • 2. From Association to Causation
    • 2.1 Understanding the distinction and its implications
    • 2.2 Untested assumptions and new notation
  • 3. Structural Models, Diagrams, Causal Effects, and Counterfactuals
    • 3.1 A brief introduction to structural equation models
    • 3.2 From linear to nonparametric models and graphs
      • 3.2.1 Representing interventions
      • 3.2.2 Estimating the effect of interventions
      • 3.2.3 Causal effects from data and graphs
    • 3.3 Coping with unmeasured confounders
      • 3.3.1 Covariateselection– theback-doorcriterion
      • 3.3.2 Confoundingequivalence–agraphicaltest
      • 3.3.3 General control of confounding
      • 3.3.4 From identificationtoestimation
    • 3.4 Counterfactual analysis in structural models
  • 4. Methodological Principles of Causal Inference
    • 4.1 Defining the target quantity
    • 4.2 Explicating causal assumptions
    • 4.3 Identification, estimation, and approximation
  • 5. The Potential Outcome Framework
    • 5.1 The “black-box” missing-data paradigm
    • 5.2 Problem formulation and the demystification of “ignorabil-
    • 5.3 Combining graphs and potential outcomes
  • 6. Counterfactuals at Work
    • 6.1 Mediation: Direct and indirect effects
      • 6.1.1 Direct versustotaleffects
      • 6.1.2 Controlleddirect-effects
      • 6.1.3 Naturaldirecteffects
      • 6.1.4 Naturalindirecteffects
    • 6.2 The MediationFormula: a simplesolutionto a thornyprob-
    • 6.3 Causes of effects and probabilities of causation
  • 7. Conclusions
  • References

An Introduction to Causal Inference

Causal Inference
Author

Judea Pearl

Published

Nov, 2010

The International Journal of Biostatistics

Volume 6,Issue 2 2010 Article 7

CAUSAL INFERENCE

Recommended Citation:

Pearl, Judea (2010) “An Introduction to Causal Inference,”The International Journal of Biostatistics: Vol. 6: Iss. 2, Article 7.

Abstract

This paper summarizes recent advances in causal inference and underscores the paradigmatic shifts that must be undertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that underlie all causal inferences, the languages used in formulating those assumptions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coherent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries: those about (1) the effects of potential interventions, (2) probabilities of counterfactuals, and (3) direct and indirect effects (also known as “mediation”). Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both. The tools are demonstrated in the analyses of mediation, causes of effects, and probabilities of causation.

KEYWORDS: structural equation models, confounding, graphical methods, counterfactuals, causal effects, potential-outcome, mediation, policy evaluation, causes of effects Author Notes: Portions of this paper are adapted from Pearl (2000a, 2009a,b); I am indebted to EljaArjas,SanderGreenland,DavidMacKinnon,PatrickShrout,andmanyreadersoftheUCLA Causality Blog (http://www.mii.ucla.edu/causality/) for reading and commenting on various segments of this manuscript, and especially to Erica Moodie and David Stephens for their thorougheditorialinput.ThisresearchwassupportedinpartsbyNIHgrant#1R01LM009961-01, NSF grant #IIS-0914211, and ONR grant #N000-14-09-1-0665.

1. Introduction

Most studies in the health, social and behavioral sciences aim to answer causal

rather than associative – questions. Such questions requiresome knowledgeof the data-generatingprocess, andcannotbecomputedfromthedataalone, norfromthe distributions that govern the data. Remarkably, although much of the conceptual framework and algorithmic tools needed for tackling such problems are now well established, they are not known to many of the researchers who could put them into practical use. Solving causal problems systematically requires certain exten- sions in the standard mathematical language of statistics, and these extensions are not typically emphasized in the mainstream literature. As a result, many statistical researchers have not yet benefited from causal inference results in (i) counterfac- tualanalysis,(ii)nonparametricstructuralequations,(iii)graphicalmodels,and(iv) the symbiosis between counterfactual and graphical methods. This survey aims at making these contemporaryadvances more accessible by providing a gentle intro- duction to causal inference for a more in-depth treatment and its methodological principles(see (Pearl,2000a, 2009a,b)).

In Section 2, we discuss coping with untested assumptions and new math- ematical notation which is requiredto move fromassociational to causal statistics.

Section 3.1 introduces the fundamentals of the structural theory of causation and uses these modeling fundamentals to represent interventions and develop mathe- matical tools for estimating causal effects (Section 3.3) and counterfactual quanti- ties (Section 3.4). Section 4 outlines a general methodology to guide problems of causal inference: Define, Assume, IdentifyandEstimate, with each step benefiting fromthe toolsdevelopedinSection3.

Section 5 relates these tools to those used in the potential-outcome frame- work, and offers a formal mapping between the two frameworks and a symbiosis (Section 5.3) that exploits the best features of both. Finally, the benefit of this symbiosisisdemonstratedinSection6,inwhichthestructure-basedlogicofcoun- terfactuals is harnessed to estimate causal quantities that cannot be defined within theparadigmofcontrolledrandomizedexperiments. These includedirectandindi- rect effects, the effect of treatmenton the treated, and questions of attribution, i.e., whetheroneevent canbe deemed“responsible” foranother.

2. From Association to Causation

2.1 Understanding the distinction and its implications

Theaimofstandardstatisticalanalysisistoassessparametersofadistributionfrom samples drawn of that distribution. With the help of such parameters, associations among variables can be inferred, which permits the researcher to estimate prob- abilities of past and future events and update those probabilities in light of new information. These tasks are managed well by standard statistical analysis so long as experimentalconditionsremainthesame. Causal analysis goes onestep further; its aim is to infer probabilities under conditions that are changing, for example, changes inducedbytreatmentsorexternalinterventions.

This distinction implies that causal and associational concepts do not mix; thereis nothingin a distributionfunctionto tellus howthat distributionwoulddif- fer if external conditions were to change—say from observational to experimental setup—because the laws ofprobabilitytheorydonotdictatehowonepropertyofa distribution ought to change when another property is modified. This information must be provided by causal assumptions which identify relationships that remain invariantwhen externalconditionschange.

A useful demarcation line between associational and causal concepts crisp and easy to apply, can be formulated as follows. An associational concept is any relationship that can be defined in terms of a joint distribution of observed vari- ables, and a causal concept is any relationship that cannot be defined from the distribution alone. Examples of associational concepts are: correlation, regres- sion, dependence, conditional independence, likelihood, collapsibility, propensity score, risk ratio, odds ratio, marginalization, conditionalization, “controlling for,” andmanymore. Examplesofcausalconceptsare: randomization,influence,effect, confounding, “holding constant,” disturbance, error terms, structural coefficients, spurious correlation, faithfulness/stability, instrumentalvariables, intervention, ex- planation, and attribution. The former can, while the latter cannot be defined in termofdistributionfunctions.

Thisdemarcationlineisextremelyusefulintracingtheassumptionsthatare needed for substantiating various types of scientific claims. Every claim invoking causalconceptsmustrelyonsomepremisesthatinvokesuchconcepts;itcannotbe inferredfrom,orevendefinedin termsstatisticalassociations alone.

This distinction furtherimplies that causal relations cannot be expressed in the language of probability and, hence, that any mathematical approach to causal analysis must acquire new notation – probability calculus is insufficient. To illus- trate,thesyntaxofprobabilitycalculusdoesnotpermitustoexpressthesimplefact that “symptoms do not cause diseases,” let alone draw mathematical conclusions from such facts. All we can say is that two events are dependent—meaning that if we find one, we can expect to encounter the other, but we cannot distinguish sta- tistical dependence, quantified by the conditional probability P(disease|symptom) from causal dependence, for which we have no expression in standard probability calculus.

2.2 Untested assumptions and new notation

1 The preceding two requirements: (1) to commence causal analysis with untested, theoretically or judgmentally based assumptions, and (2) to extend the syntax of probabilitycalculus, constitutethetwoprimarybarrierstotheacceptance ofcausal analysis amongprofessionalswithtraditionaltraininginstatistics.

Associational assumptions, even untested, are testable in principle, given sufficiently large sample and sufficiently fine measurements. Causal assumptions, in contrast, cannot be verified even in principle, unless one resorts to experimental control. This difference stands out in Bayesian analysis. Though the priors that Bayesians commonly assign to statistical parameters are untested quantities, the sensitivitytothesepriorstendstodiminishwithincreasingsamplesize. Incontrast, sensitivity to prior causal assumptions, say that treatment does not change gender, remainssubstantial regardlessofsample size.

This makes it doubly important that the notation we use for expressing causal assumptions be cognitively meaningful and unambiguous so that one can clearly judge the plausibility or inevitabilityof the assumptions articulated. Statis- ticianscannolongerignorethementalrepresentationinwhichscientistsstoreexpe- rientialknowledge,sinceitisthisrepresentation,andthelanguageusedtoaccess it thatdeterminethereliabilityofthejudgmentsuponwhichthe analysis so crucially depends.

Thoseversedinthepotential-outcomenotation(Neyman,1923,Rubin,1974, Holland,1988),canrecognizecausalexpressions throughthesubscriptsthatareat- tached to counterfactualevents and variables, e.g. Y (u) or Z . (Some authors use x xy parentheticalexpressions,e.g. Y(0),Y(1),Y(x,u)orZ(x,y).) TheexpressionY (u), x for example, stands for the value that outcome Y would take in individual u, had treatment X been at level x. If u is chosen at random,Y is a random variable, and x one can talk about the probability thatY would attain a value y in the population, x written P(Y = y) (see Section 5 for semantics). Alternatively, Pearl (1995) used x expressions of the formP(Y =y|set(X =x)) or P(Y =y|do(X =x)) to denote the probability (or frequency) that event (Y = y) would occur if treatment condition 1 By“untested”Imeanuntestedusingfrequencydatainnonexperimentalstudies.

2 X =x were enforceduniformlyover thepopulation. Stilla thirdnotationthatdis- tinguishes causal expressions is provided by graphical models, where the arrows convey causal directionality.

However, few have taken seriously the textbook requirement that any in- troduction of new notation must entail a systematic definition of the syntax and semantics that governs the notation. Moreover, in the bulk of the statistical litera- turebefore2000,causalclaimsrarelyappearinthemathematics. Theysurfaceonly in the verbal interpretation that investigators occasionally attach to certain associ- ations, and in the verbal description with which investigators justify assumptions.

For example, the assumption that a covariate not be affected by a treatment, a nec- essary assumption for the control of confounding (Cox, 1958, p. 48), is expressed inplainEnglish, not inamathematicalexpression.

The next section provides a conceptualization that overcomes these mental barriersbyofferingafriendlymathematicalmachineryforcause-effectanalysisand a formalfoundationforcounterfactualanalysis.

3. Structural Models, Diagrams, Causal Effects, and Counterfactuals

Any conception of causation worthy of the title “theory” must be able to (1) represent causal questions in some mathematical language, (2) provide a precise lan- guage for communicating assumptions under which the questions need to be an- swered, (3)providea systematic wayofansweringat least someofthese questions andlabelingothers“unanswerable,” and(4)provideamethodofdeterminingwhat assumptions ornewmeasurementswouldbeneeded toanswer the“unanswerable” questions.

A “general theory” should do more. In addition to embracing all questions judged to have causal character, a general theorymust also subsume any other the- ory or method that scientists have found useful in exploring the various aspects of causation. In other words, any alternative theory needs to evolve as a special case ofthe“general theory”whenrestrictionsareimposed oneitherthemodel, thetype ofassumptions admitted,orthelanguage inwhichthose assumptionsare cast.

Thestructuraltheorythatweuseinthissurveysatisfiesthecriteriaabove. It is based on the StructuralCausal Model (SCM) developed in (Pearl, 1995, 2000a) 2 Clearly,P(Y =y|do(X =x)) isequivalenttoP(Yx=y). Thisiswhatwenormallyassess ina controlledexperiment,withX randomized,inwhichthedistributionofY isestimatedforeachlevel xofX.

which combines features of the structural equation models (SEM) used in eco- nomicsandsocialscience(Goldberger,1973,Duncan,1975),thepotential-outcome framework of Neyman (1923) and Rubin (1974), and the graphical models devel- oped for probabilistic reasoning and causal analysis (Pearl, 1988, Lauritzen, 1996, Spirtes,Glymour,andScheines, 2000, Pearl,2000a).

Although the basic elements of SCM were introduced in the mid 1990’s (Pearl,1995),andhave been adaptedwidelybyepidemiologists(Greenland,Pearl, andRobins,1999,GlymourandGreenland,2008),statisticians(CoxandWermuth, 2004, Lauritzen, 2001), and social scientists (Morgan and Winship, 2007), its po- tentials as a comprehensive theory of causation are yet to be fully utilized. Its ramificationsthusfarinclude:

  1. The unification of the graphical, potential outcome, structural equations, de- cision analytical (Dawid, 2002),interventional(Woodward,2003), sufficient component (Rothman,1976) and probabilistic(Suppes, 1970)approaches to causation;witheach approachviewedas a restrictedversionoftheSCM.
  2. The definition, axiomatization and algorithmization of counterfactuals and jointprobabilitiesofcounterfactuals
  3. Reducingtheevaluationof“effectsofcauses,”“mediatedeffects,”and“causes ofeffects”toan algorithmiclevelofanalysis.
  4. Solidifying the mathematical foundations of the potential-outcome model, andformulatingthecounterfactualfoundationsofstructuralequationmodels.
  5. Demystifyingenigmaticnotionssuchas“confounding,”“mediation,”“ignor- ability,”“comparability,”“exchangeability(ofpopulations),”“superexogene- ity”andothers withina singleand familiarconceptualframework.
  6. Weeding out myths and misconceptions from outdated traditions (Meek and Glymour, 1994, Greenland et al., 1999, Cole and Herna´n, 2002, Arah,2008, Shrier,2009,Pearl,2009c).

This section provides a gentle introductionto the structural frameworkand uses it to present the main advances in causal inference that have emerged in the past twodecades.

3.1 A brief introduction to structural equation models

Howcanoneexpressmathematicallythecommonunderstandingthatsymptomsdo not cause diseases? The earliest attemptto formulatesuch relationshipmathemati- cally was made in the 1920’s by the geneticist SewallWright (1921). Wright used a combination of equations and graphs to communicate causal relationships. For example, if X stands for a disease variable and Y stands for a certain symptom of 3 thedisease, Wrightwouldwritea linearequation: y=βx+u (1) Y where x stands for the level (or severity) of the disease, y stands for the level (or severity) of the symptom, and u stands for all factors, other than the disease in Y question, that could possibly affectY when X is held constant. In interpretingthis equationoneshouldthinkofaphysicalprocesswherebyNatureexaminesthevalues ofx and uand, accordingly,assigns variableY thevaluey=βx+u . Similarly,to Y “explain”theoccurrenceofdisease X,onecouldwritex=u ,whereU standsfor X X allfactorsaffectingX.

Equation (1) still does not properly express the causal relationship implied bythisassignmentprocess, because algebraicequationsaresymmetricalobjects;if we re-write(1)as x=(y−u )/β (2) Y it mightbemisinterpretedto meanthat thesymptom influencesthe disease. To ex- press the directionality of the underlying process, Wright augmented the equation with a diagram, later called “path diagram,” in which arrows are drawn from (per- ceived)causes to their(perceived)effects, and moreimportantly,theabsence ofan arrowmakestheempiricalclaimthatNatureassignsvaluestoonevariableirrespec- tiveofanother. InFig. 1,forexample, theabsence ofarrowfromY to X represents theclaimthatsymptomY isnotamongthefactorsU whichaffectdiseaseX. Thus, X in our example, the complete model of a symptom and a disease would be written asinFig.1: Thediagramencodesthepossibleexistenceof(direct)causalinfluence of X onY, and the absence of causal influence ofY on X, while the equations en- code the quantitative relationships among the variables involved, to be determined fromthe data. The parameterβ in the equation is called a “path coefficient”and it quantifies the (direct)causal effectof X onY; given the numericalvalues ofβ and U , the equation claims that, a unit increase forX would result inβ units increase Y ofY regardless of the values taken by other variables in the model, and regardless ofwhethertheincrease inX originatesfromexternalorinternalinfluences.

ThevariablesU andU arecalled“exogenous;”theyrepresentobservedor X Y unobserved background factors that the modeler decides to keep unexplained, that is, factors that influence but are not influenced by the other variables (called “en- dogenous”) in the model. Unobserved exogenous variables are sometimes called “disturbances”or“errors”,theyrepresentfactorsomittedfromthemodelbutjudged 3 Linear relations are used here for illustration purposes only; they do not represent typical disease-symptom relationsbutillustratethe historicaldevelopmentof pathanalysis. Additionally, wewillusestandardizedvariables,thatis,zeromeanandunitvariance.

toberelevantforexplainingthebehaviorofvariablesinthemodel. VariableU ,for X example, represents factorsthat contributetothe disease X, which mayor maynot becorrelatedwithU (thefactorsthatinfluencethesymptomY). Thus,background Y factors in structural equations differ fundamentally from residual terms in regres- sion equations. The latters are artifacts ofanalysis which, by definition, are uncor- related with the regressors. The formers are part of physical reality (e.g., genetic factors, socio-economic conditions) which are responsible for variations observed in the data; they are treated as any other variable, though we often cannot measure theirvaluespreciselyandmustresigntomerelyacknowledgingtheirexistence and assessing qualitativelyhowthey relatetoothervariablesinthe system.

Ifcorrelationis presumed possible, it is customary to connect the twovari- ables,U andU , byadashed doublearrow,as showninFig. 1(b).

Y X U U U U X Y X Y x = u X y = β x + u Y X β Y X β Y (a) (b) Figure 1: A simple structural equation model, and its associated diagrams. Unob- served exogenousvariables areconnectedbydashed arrows.

In reading path diagrams, it is common to use kinship relations such as parent, child, ancestor, and descendent, the interpretation of which is usually self evident. For example, an arrow X →Y designates X as a parent of Y and Y as a child of X. A “path” is any consecutive sequence of edges, solid or dashed. For example, there are two paths between X and Y in Fig. 1(b), one consisting of the directarrowX →Y whilethe othertracingthe nodesX,U ,U andY.

X Y Wright’s major contribution to causal analysis, aside from introducing the languageofpathdiagrams, hasbeenthedevelopmentofgraphicalrulesforwriting down the covariance ofany pairof observed variables in terms of path coefficients andofcovariancesamongtheerrorterms. Inoursimpleexample, onecanimmedi- atelywritetherelations Cov(X,Y)=β (3) forFig.1(a),and Cov(X,Y)=β+Cov(U ,U ) (4) Y X for Fig. 1(b) (These can be derived of course from the equations, but, for large models, algebraic methods tend to obscure the origin of the derived quantities).

Under certain conditions, (e.g. if Cov(U ,U ) =0), such relationships may allow Y X onetosolveforthepathcoefficientsintermofobservedcovariancetermsonly,and this amounts to inferring the magnitude of (direct) causal effects from observed, nonexperimental associations, assuming of course that one is prepared to defend thecausal assumptions encodedin thediagram.

It is important to note that, in path diagrams, causal assumptions are en- coded not in the links but, rather, in the missing links. An arrow merely indicates thepossibilityofcausalconnection,thestrengthofwhichremainstobedetermined (from data); a missing arrow represents a claim of zero influence, while a missing double arrow represents a claim of zero covariance. In Fig. 1(a), for example, the assumptions thatpermitsus toidentifythedirecteffectβare encodedby themiss- ingdoublearrowbetweenU andU , indicatingCov(U ,U )=0, togetherwiththe X Y Y X missing arrow from Y to X. Had any of these two links been added to the dia- gram, we would not have been able to identify the direct effectβ. Such additions wouldamountto relaxingthe assumptionCov(U ,U )=0, or theassumption that Y X Y does not effect X, respectively. Note also that both assumptions are causal, not associational, since none can be determined fromthe joint density of the observed variables, X andY; the association between the unobserved terms,U andU , can Y X only be uncovered in an experimental setting; or (in more intricate models, as in Fig.5)fromothercausal assumptions.

Although each causal assumption in isolation cannot be tested, the sum to- tal of all causal assumptions in a model often has testable implications. The chain model of Fig. 2(a), for example, encodes seven causal assumptions, each corre- spondingtoamissingarroworamissingdouble-arrowbetweenapairofvariables.

None of those assumptions is testable in isolation, yet the totality of all those as- sumptionsimpliesthatZisunassociatedwithY ineverystratumofX. Suchtestable implications can be read off the diagrams using a graphical criterion known as d- separation (Pearl,1988).

Definition 1(d-separation) A set S of nodes is said to block a path p ifeither (i) p containsatleastonearrow-emittingnodethatisinS,or(ii) pcontainsatleastone collision node that is outside S and has no descendant in S. If S blocks all paths fromX toY,itissaidto“d-separateX andY,”andthen,X andY are independent givenS, writtenX⊥⊥Y|S.

To illustrate, the path U → Z → X → Y is blocked by S = {Z} and by Z S = {X}, since each emits an arrow along that path. Consequently we can infer thattheconditionalindependenciesU ⊥⊥Y|Z andU ⊥⊥Y|X willbesatisfiedinany Z Z probabilityfunctionthatthismodelcangenerate,regardlessofhowweparametrize the arrows. Likewise, the pathU →Z →X ←U is blocked by the null set {0/} Z X but is not blocked by S = {Y}, since Y is a descendant of the collision node X.

Consequently, the marginal independence U ⊥⊥U will hold in the distribution, Z X butU ⊥⊥U |Y may or may not hold. This special handling of collision nodes (or Z X colliders, e.g., Z → X ←U ) reflects a general phenomenon known as Berkson’s X paradox (Berkson,1946),wherebyobservationsona commonconsequence oftwo independent causes render those causes dependent. For example, the outcomes of twoindependentcoins are rendereddependentby thetestimony thatat least one of themis atail.

Theconditionalindependenciesentailedbyd-separationconstitutethemain openingthroughwhichtheassumptionsembodiedinstructuralequationmodelscan confrontthe scrutinyofnonexperimentaldata. Inotherwords,almostall statistical 4 tests capable ofinvalidatingthemodelareentailed bythoseimplications.

U U U U U U Z X Y Z X Y x 0 Z X Y Z X Y (a) (b) Figure 2: (a) The diagram associated with the structural model of Eq. (5). (b) The diagram associated with the modified model of Eq. (6), representing the interven- tiondo(X =x0).

3.2 From linear to nonparametric models and graphs

Structural equation modeling (SEM) has been the main vehicle for effect analysis in economics and the behavioral and social sciences (Goldberger, 1972, Duncan, 1975, Bollen, 1989). However, the bulk of SEM methodology was developed for linear analysis and, until recently, no comparable methodology has been devised to extend its capabilities to models involving dichotomous variables or nonlinear dependencies. A centralrequirementforany such extension is todetach the notion of “effect” fromits algebraic representation as a coefficientin an equation, and re- define “effect”as a general capacity totransmitchanges among variables. Such an extension, based on simulating hypothetical interventions in the model, was pro- posedin(Haavelmo,1943,StrotzandWold,1960,Spirtes,Glymour,andScheines, 1993, Pearl,1993a, 2000a, Lindley, 2002)andhas led tonew waysofdefiningand estimatingcausal effectsinnonlinearand nonparametricmodels(thatis, modelsin whichthe functionalformofthe equationsis unknown).

4 Additionalimplicationscalled“dormantindependence”(ShpitserandPearl,2008)maybede- ducedfromsomegraphswithcorrelatederrors(VermaandPearl,1990).

The centralidea istoexploittheinvariantcharacteristicsofstructuralequa- tions without committing to a specific functional form. For example, the non- parametric interpretation of the diagram of Fig. 2(a) corresponds to a set of three functions,each correspondingto oneoftheobserved variables:

\[z = f_Z(u_Z)\] \[x = f_X(z,u_X)\] \[y = f_Y(x,u_Y)\]

where in this particular example \(U_Z, U_X\) are assumed to be jointly independent but, otherwise, arbitrarilydistributed. Each ofthese functionsrepresents a causalprocess(ormechanism)thatdeterminesthevalueoftheleftvariable(output) fromthose on the rightvariables (inputs). The absence of a variablefromthe right hand side of an equation encodes the assumption that Nature ignores that variable in the process of determining the value of the output variable. For example, the absence of variable \(Z\) from the arguments of \(f_Y\) conveys the empirical claim that variations in Z will leave Y unchanged, as long as variables \(U_Y\), and \(X\) remain constant. A system of such functions are said to be structural if they are assumed to be autonomous,that is,each function is invariant to possible changes in the form of the other functions (Simon,1953,Koopmans, 1953).

3.2.1 Representing interventions

This featureofinvariancepermitsusto usestructuralequationsas abasis formod- elingcausal effectsand counterfactuals. This is donethroughamathematicaloper- ator called do(x) which simulates physical interventions by deleting certain func- tionsfromthemodel,replacingthembyaconstantX =x, whilekeepingtherestof themodelunchanged. Forexample, toemulatean interventiondo(x0)that holdsX constant (at X =x0) in model M of Fig. 2(a), we replace the equation for x in Eq.

(5)withx=x0,and obtainanew model,M x0,

\[z = f_Z(u_Z)\] \[x = x_0\] \[y = f_Y(x,u_Y)\]

thegraphicaldescriptionofwhichis showninFig. 2(b).

The joint distribution associated with the modified model, denoted \(P(z,y|do(x_0))\) describesthepost-interventiondistributionofvariablesY andZ (also called “controlled” or “experimental” distribution), to be distinguished from the pre-intervention distribution, \(P(x,y,z)\), associated with the original model of Eq.

(5). Forexample, ifX represents atreatmentvariable,Y a response variable,and Z some covariate that affects the amount of treatment received, then the distribution P(z,y|do(x0)) gives the proportion of individuals that would attain response level Y =y andcovariatelevel Z =z underthehypotheticalsituationin whichtreatment X =x0 is administereduniformlyto thepopulation.

In general, we can formallydefine the post-intervention distributionby the equation:

\[P_M(y|do(x)) \triangleq P_{M_{x}}(y) \tag{7}\]

In words: In the frameworkof model M, the post-intervention distribution of out- come Y is defined as the probability that model M assigns to each outcome level Y = y.

Fromthisdistribution,oneisabletoassess treatmentefficacybycomparing aspectsofthisdistributionatdifferentlevelsofx0. Acommonmeasureoftreatment efficacyis theaverage difference

\[ E(Y|do(x'_0)) - E(Y|do(x_0)) \tag{8)\]

where x′ and x0 are two levels (or types) of treatment selected for comparison.

Anothermeasureis theexperimentalRisk Ratio

\[E(Y|do(x′_0)) / E(Y|do(x_0)) \tag{9}\]

The variance \(Var(Y|do(x_0))\), or any other distributional parameter, may also enter the comparison; all these measures can be obtained from the controlled distribu- tion function \(P(Y =y|do(x)) = ∑_{z} P(z,y|do(x))\) which was called “causal effect” in Pearl(2000a, 1995)(see footnote2). The centralquestion in the analysis of causal effects is the question of identification: Can the controlled (post-intervention)dis- tribution, \(P(Y =y|do(x))\), be estimated from data governed by the pre-intervention distribution, \(P(z,x,y)\)?

The problemof identificationhas received considerable attention inecono- metrics (Hurwicz, 1950, Marschak, 1950, Koopmans, 1953) and social science (Duncan, 1975, Bollen, 1989), usually in linear parametric settings, where it re- duces to asking whether some model parameter,β, has a unique solution in terms of the parameters of P (the distributionof the observed variables). In the nonpara- metricformulation,identificationismoreinvolved,sincethenotionof“hasaunique solution” does not directly apply to causal quantities such as Q(M) = P(y|do(x)) which have no distinct parametric signature, and are defined procedurally by sim- ulating an intervention in a causal model M (as in (6)). The following definition overcomes these difficulties:

Definition 2(Identifiability(Pearl,2000a, p. 77)) A quantity Q(M) is identifiable, given a set of assumptions A, if for any two models M1 and M2 that satisfy A, we have \[P(M_1)=P(M_2) ⇒ Q(M_1)=Q(M_2) \tag{10}\]

In words, the details of M1 and M2 do not matter; what matters is that the assumptions in A(e.g., thoseencoded inthediagram)would constrainthevariabil- ity of those details in such a way that equality of P’s would entail equality of Q’s.

When this happens, Q depends on P only, and should therefore be expressible in terms of the parameters of P. The next subsections exemplify and operationalize thisnotion.

3.2.2 Estimating the effect of interventions

To understand how hypothetical quantities such as P(y|do(x)) or E(Y|do(x0)) can be estimated from actual data and a partially specified model let us begin with a simpledemonstrationonthemodelofFig. 2(a). We willsee that,despite ourigno- rance of f X, f Y, f and P(u), E(Y|do(x0)) is nevertheless identifiable and is given Z bytheconditionalexpectationE(Y|X =x0). Wedothisbyderivingandcomparing theexpressionsforthesetwoquantities,asdefinedby(5)and(6),respectively. The mutilatedmodelinEq. (6)dictates: E(Y|do(x0))=E(f Y(x0,u Y)), (11) whereas thepre-interventionmodelofEq. (5)gives E(Y|X =x0)) = E(f Y(X,u Y)|X =x0) = E(f Y(x0,u Y)|X =x0) (12) = E(f Y(x0,u Y)) whichis identicalto(11). Therefore, E(Y|do(x0))=E(Y|X =x0)) (13) Using a similar derivation, though somewhat more involved, we can show that P(y|do(x)) isidentifiableandgiven bytheconditionalprobabilityP(y|x).

We see that the derivationof(13) was enabled by two assumptions; first,Y is a functionofX andU only, and, second,U is independent of{U ,U }, hence Y Y Z X of X. The latter assumption parallels the celebrated “orthogonality” condition in linearmodels,Cov(X,U )=0, whichhas been used routinely,oftenthoughtlessly, Y tojustifytheestimationofstructuralcoefficientsbyregressiontechniques.

Naturally, if we were to apply this derivation to the linear models of Fig.

1(a)or1(b),wewouldgettheexpecteddependencebetweenY andtheintervention do(x0): E(Y|do(x0)) = E(f Y(x0,u Y)) = E(βx0+u Y) (14) = βx0

This equality endows β with its causal meaning as “effect coefficient.” It is ex- tremely important to keep in mind that in structural (as opposed to regressional) models, β is not “interpreted” as an effect coefficient but is “proven” to be one by the derivationabove. β willretain this causal interpretationregardless of how X is actually selected (throughthe function f , Fig. 2(a))and regardless of whetherU X X andU arecorrelated(asinFig.1(b))oruncorrelated(asinFig.1(a)). Correlations Y may onlyimpedeour abilitytoestimateβ fromnonexperimentaldata, butwillnot change its definition as given in (14). Accordingly, and contrary to endless confu- sions in the literature (see footnote 12) structural equations say absolutely nothing about the conditional expectation E(Y|X = x). Such connection may exist under special circumstances, e.g., ifcov(X,U )=0, as in Eq. (13),butis otherwiseirrel- Y evanttothedefinitionorinterpretationofβaseffectcoefficient,ortotheempirical claims ofEq. (1).

The next subsection will circumvent these derivations altogether by reduc- ingtheidentificationproblemtoagraphicalprocedure. Indeed,sincegraphsencode all the information that non-parametric structural equations represent, they should permitustosolvetheidentificationproblemwithoutresortingtoalgebraicanalysis.

3.2.3 Causal effects from data and graphs

Causal analysis in graphical models begins with the realization that all causal ef- fects areidentifiablewheneverthemodelis Markovian,thatis, thegraphis acyclic (i.e., containingno directed cycles) and all the errorterms are jointly independent.

Non-Markovian models, such as those involving correlated errors (resulting from unmeasured confounders), permit identification only under certain conditions, and these conditionstoo can bedeterminedfromthegraphstructure(Section3.3). The keytothese results restswiththe followingbasic theorem.

Theorem1 (TheCausal MarkovCondition) AnydistributiongeneratedbyaMarko- vianmodelM can befactorized as: P(v1,v2,…,v n)=∏P(v i|pa i) (15) i whereV1,V2,…,V are theendogenousvariables inM, and pa are (values of)the n i endogenous“parents” ofV in thecausaldiagramassociated with M.

i For example, the distribution associated with the model in Fig. 2(a) can be factorizedas P(z,y,x)=P(z)P(x|z)P(y|x) (16) since X is the(endogenous)parentofY,Z isthe parentofX, andZ has noparents.

Corollary1(Truncated factorization) For any Markovian model, the distribution generated by an intervention do(X = x0) on a set X of endogenous variables is givenby thetruncated factorization P(v1,v2,…,v k|do(x0))= ∏ P(v i|pa i)| (17) x=x0 i|Vi̸∈X 5 where P(v |pa)are the pre-interventionconditionalprobabilities.

i i Corollary 1 instructs us to remove from the product of Eq. (15) those fac- tors that quantify how the intervened variables (members of set X) are influenced by their pre-interventionparents. This removal followsfrom the fact that the post- intervention model is Markovian as well, hence, following Theorem 1, it must generate a distribution that is factorized according to the modified graph, yielding the truncated product of Corollary 1. In our example of Fig. 2(b), the distribution P(z,y|do(x0)) associated withthe modifiedmodelis givenby P(z,y|do(x0))=P(z)P(y|x0) where P(z) and P(y|x0) are identical to those associated with the pre-intervention distribution of Eq. (16). As expected, the distribution of Z is not affected by the intervention,since P(z|do(x0))=∑P(z,y|do(x0))=∑P(z)P(y|x0)=P(z) y y whilethatofY is sensitive tox0, andis givenby P(y|do(x0))=∑P(z,y|do(x0))=∑P(z)P(y|x0)=P(y|x0) z z This example demonstrates how the (causal) assumptions embedded in the model M permit us to predict the post-intervention distribution from the pre-intervention 5 AsimpleproofoftheCausalMarkovTheorem isgiveninPearl (2000a,p.30). Thistheorem wasfirstpresentedinPearlandVerma (1991),butitisimplicitintheworksofKiiveri,Speed,and Carlin(1984)andothers. Corollary1wasnamed“ManipulationTheorem” inSpirtesetal.(1993), andisalsoimplicitinRobins’(1987)G-computationformula. SeeLauritzen(2001).

distribution, which further permits us to estimate the causal effect of X onY from nonexperimentaldata,sinceP(y|x0)isestimablefromsuchdata. Notethatwehave made noassumption whatsoever on theformofthe equations orthe distributionof theerrorterms;itisthestructureofthegraphalone(specifically,theidentityofX’s parents)thatpermitsthederivationtogo through.

The truncated factorization formula enables us to derive causal quantities directly, without dealing with equations or equation modification as in Eqs. (11)– (13). Consider,forexample,themodelshowninFig.3,inwhichtheerrorvariables Z 1 Z 2 Z 3 X Y Figure 3: Markovian model illustrating the derivation of the causal effect of X on Y,Eq. (20). Errortermsare notshownexplicitly.

are kept implicit. Instead of writing down the corresponding five nonparametric equations, wecan writethejointdistributiondirectlyas P(x,z1,z2,z3,y)=P(z1)P(z2)P(z3|z1,z2)P(x|z1,z3)P(y|z2,z3,x) (18) where each marginal or conditional probability on the right hand side is directly estimable from the data. Now suppose we intervene and set variable X to x0. The post-intervention distributioncan readily be written (using the truncated factoriza- tionformula(17))as P(z1,z2,z3,y|do(x0))=P(z1)P(z2)P(z3|z1,z2)P(y|z2,z3,x0) (19) andthe causaleffectof X onY can beobtainedimmediatelybymarginalizingover theZ variables,giving P(y|do(x0))= ∑ P(z1)P(z2)P(z3|z1,z2)P(y|z2,z3,x0) (20) z1,z2,z3 Notethatthisformulacorrespondspreciselytowhatiscommonlycalled“adjusting for Z1,Z2 and Z3” and, moreover, we can write down this formula by inspection, without thinking on whether Z1,Z2 and Z3 are confounders, whether they lie on the causal pathways, andso on. Though such questions can be answered explicitly from the topology of the graph, they are dealt with automatically when we write downthetruncatedfactorizationformulaandmarginalize.

Note also that the truncated factorization formula is not restricted to inter- ventions on a single variable; it is applicable to simultaneous or sequential inter- ventions such as those invokedin the analysis of time varying treatment with time varying confounders (Robins, 1986, Arjas and Parner, 2004). For example, if X and Z2 are both treatment variables, and Z1 and Z3 are measured covariates, then thepost-interventiondistributionwouldbe P(z1,z3,y|do(x),do(z2))=P(z1)P(z3|z1,z2)P(y|z2,z3,x) (21) 6 andthe causal effectofthetreatmentsequence do(X =x),do(Z2=z2) wouldbe P(y|do(x),do(z2))= ∑ P(z1)P(z3|z1,z2)P(y|z2,z3,x) (22) z1,z3 This expression coincides with Robins’ (1987) G-computation formula, which was derived from a more complicated set of (counterfactual) assumptions.

AsnotedbyRobins,theformuladictatesanadjustmentforcovariates(e.g.,Z3)that mightbeaffectedbyprevioustreatments(e.g., Z2).

3.3 Coping with unmeasured confounders

Thingsaremorecomplicatedwhenwefaceunmeasuredconfounders. Forexample, itisnotimmediatelyclearwhethertheformulainEq.(20)canbeestimatedifanyof Z1,Z2 and Z3 is not measured. A few but challenging algebraic steps wouldreveal thatone canperformthesummationover Z2 toobtain P(y|do(x0))= ∑ P(z1)P(z3|z1)P(y|z1,z3,x0) (23) z1,z3 whichmeans that we needonly adjustfor Z1 and Z3 withoutevermeasuringZ2. In general,itcanbeshown(Pearl,2000a,p.73)that,wheneverthegraphisMarkovian the post-interventional distribution P(Y =y|do(X =x)) is given by the following expression: P(Y =y|do(X =x))=∑P(y|t,x)P(t) (24) t where T is the set of direct causes of X (also called “parents”) in the graph. This allowsustowrite(23)directlyfromthegraph,thusskippingthealgebrathatledto (23). Itfurtherimpliesthat,nomatterhowcomplicatedthemodel,theparentsofX arethe onlyvariablesthatneedto bemeasuredto estimatethecausal effectsofX.

6 Forclarity,wedropthe(superfluous)subscript0fromx0 andz20.

It is not immediately clear however whether other sets of variables beside X’s parents suffice for estimating the effect of X, whether some algebraic manipu- lation can further reduce Eq. (23), or that measurement of Z3 (unlike Z1, or Z2) is necessary in any estimation of P(y|do(x0)). Such considerations become transpar- entfromagraphicalcriteriontobe discussed next.

3.3.1 Covariateselection– theback-doorcriterion

Consider an observational study where we wish to find the effect of X on Y, for example, treatment on response, and assume that the factors deemed relevant to the problem are structured as in Fig. 4; some are affecting the response, some are Z 1 Z 2 W 1 Z 3 W 2 X W 3 Y Figure4: Markovianmodelillustratingtheback-doorcriterion. Errortermsarenot shownexplicitly.

affecting the treatment and some are affecting both treatment and response. Some of these factors may be unmeasurable, such as genetic trait or life style, others are measurable, such as gender, age, and salary level. Our problem is to select a subsetofthesefactorsformeasurementandadjustment,namely,thatifwecompare treated vs. untreated subjects having the same values of the selected factors, we get the correct treatment effect in that subpopulation of subjects. Such a set of factors is called a “sufficient set” or “admissible set” foradjustment. The problem ofdefininganadmissibleset,letalonefindingone,has baffledepidemiologistsand social scientists fordecades (see (Greenlandetal., 1999,Pearl,1998)forreview).

The following criterion, named “back-door” in (Pearl, 1993a), settles this problembyprovidingagraphicalmethodofselectingadmissiblesets offactorsfor adjustment.

Definition3(Admissiblesets – theback-doorcriterion) A set S is admissible (or “sufficient”)for adjustmentiftwo conditionshold: 1. No elementofS is adescendantof X 2. The elements of S “block” all “back-door” paths from X to Y, namely all pathsthatend withanarrow pointingtoX.

In this criterion, “blocking” is interpreted as in Definition 1. For example, the set S = {Z3} blocks the path X ←W1 ← Z1 → Z3 →Y, because the arrow-emitting node Z3 is in S. However, the set S = {Z3} does not block the path X ←W1 ← Z1→Z3←Z2 →W2→Y, becausenoneofthearrow-emittingnodes, Z1 andZ2, is inS, and thecollisionnodeZ3 is notoutside S.

Basedonthiscriterionwesee,forexample,thatthesets{Z1,Z2,Z3},{Z1,Z3}, {W1,Z3}, and {W2,Z3}, each is sufficient for adjustment, because each blocks all back-door paths between X andY. The set {Z3}, however, is not sufficient for ad- justment because, as explained above, it does not block the path X ←W1 ←Z1 → Z3 ←Z2 →W2→Y.

The intuition behind the back-door criterion is as follows. The back-door paths in the diagram carry spurious associations from X to Y, while the paths di- rected along the arrows from X to Y carry causative associations. Blocking the formerpaths(byconditioningon S)ensures thatthemeasuredassociation between X andY is purely causative, namely, it correctlyrepresents the target quantity: the causaleffectofX onY. Thereasonforexcludingdescendants ofX (e.g.,W3 orany ofitsdescendants) is givenin(Pearl,2009b, pp.338–41).

Formally, the implication of finding an admissible set S is that, stratifying onS isguaranteedtoremoveallconfoundingbiasrelativethecausal effectofX on Y. In other words, the risk difference in each stratum of S gives the correct causal effectinthatstratum. Inthebinarycase, forexample,theriskdifferenceinstratum s ofS isgivenby P(Y =1|X =1,S=s)−P(Y =1|X =0,S=s) whilethecausal effect(ofX onY)atthat stratumis givenby P(Y =1|do(X =1),S=s)−P(Y =1|do(X =0),S=s).

ThesetwoexpressionsareguaranteedtobeequalwheneverSisasufficientset,such as {Z1,Z3} or {Z2,Z3} in Fig. 4. Likewise, the average stratified risk difference, takenoverall strata, ∑ [P(Y =1|X =1,S=s)−P(Y =1|X =0,S=s)]P(S=s), s gives thecorrectcausal effectofX onY intheentirepopulation P(Y =1|do(X =1))−P(Y =1|do(X =0)).

In general, for multi-valued variables X and Y, finding a sufficient set S permitsus towrite P(Y =y|do(X =x),S=s)=P(Y =y|X =x,S=s) and P(Y =y|do(X =x))=∑P(Y =y|X =x,S=s)P(S =s) (25) s Sinceallfactorsontherighthandsideoftheequationareestimable(e.g.,byregres- sion) from the pre-interventional data, the causal effect can likewise be estimated fromsuch datawithoutbias.

An equivalentexpression forthecausal effect(25)can beobtainedby mul- tiplyingand dividingbythe conditionalprobabilityP(X =x|S=s), giving P(Y =y,X =x,S=s) P(Y =y|do(X =x))=∑ (26) P(X =x|S=s) s from which the name “Inverse Probability Weighting” has evolved (Pearl, 2000a, pp. 73,95).

Interestingly, it can be shown that any irreducible sufficient set, S, taken as a unit, satisfies the associational criterion that epidemiologists have been using to define“confounders”. In otherwords,S must beassociated withX and, simultane- ously, associated withY, given X. This need not hold for any specific members of S. For example, the variableZ3 in Fig. 4, though it is a member of every sufficient set and hence a confounder, can be unassociated with bothY and X (Pearl, 2000a, p.195). Conversely,apre-treatmentvariableZ thatisassociated withbothY andX mayneed tobeexcluded fromenteringasufficientset.

The back-door criterion allows us to write Eq. (25) directly, by selecting a sufficientsetSdirectlyfromthediagram,withoutmanipulatingthetruncatedfactor- ization formula. The selection criterion can be applied systematically to diagrams of any size and shape, thus freeinganalysts fromjudging whether“X is condition- allyignorablegivenS,” aformidablementaltaskrequiredinthepotential-response framework(RosenbaumandRubin,1983). Thecriterionalsoenablestheanalystto searchforanoptimalsetofcovariate—namely,asetSthatminimizesmeasurement cost orsamplingvariability(Tian, Paz, andPearl, 1998).

All in all, one can safely state that, armed with the back-door criterion, causality has removed “confounding”fromits store ofenigmatic andcontroversial concepts.

3.3.2 Confoundingequivalence–agraphicaltest

Another problem that has been given graphical solution recently is that of deter- miningwhetheradjustmentfortwosetsofcovariateswouldresultinthesamecon- founding bias (Pearl and Paz, 2009). The reasons forposing this question are sev- eral. First, an investigator may wish to assess, prior to taking any measurement, whether two candidate sets of covariates, differingsubstantially in dimensionality, measurement error, cost, or sample variability are equally valuable in their bias- reduction potential. Second, assuming that the structure of the underlying DAG is onlypartiallyknown,onemaywishtotest,usingadjustment,whichoftwohypoth- esizedstructuresiscompatiblewiththedata. Structuresthatpredictequalresponse to adjustment for two sets of variables must be rejected if, after adjustment, such equalityis notfoundin thedata.

Definition4((c-equivalence)) Define two sets, T and Z of covariates as c-equivalent,(cconnotes“confounding”)ifthefollowingequalityholds: ∑P(y|x,t)P(t) =∑P(y|x,z)P(z) ∀x,y (27) t z Definition5((Markovboundary)) For any set of variables S in a DAG G, the Markov boundary S of S is the minimal subset of S that d-separates X from all m other membersof S.

InFig.4,forexample,theMarkovboundaryofS={W1,Z1,Z2,Z3}isS m= {W1,Z3}.

Theorem 2 (PearlandPaz, 2009) Let Z and T be two sets of variables in G, containing no descendant of X. A necessary and sufficient conditions for Z and T to be c-equivalent is that at least oneof thefollowingconditionsholds: 1. Z =T ,(i.e., theMarkov boundaryofZ coincides withthatofT) m m 2. Z andT are admissible (i.e., satisfy the back-doorcondition) For example, the sets T = {W1,Z3} and Z = {Z3,W2} in Fig. 4 are c-equivalent, because each blocks all back-door paths from X to Y. Similarly, the non-admissible sets T = {Z2} and Z = {W2,Z2} are c-equivalent, since their Markovboundariesare the same (T =Z ={Z2}). Incontrast, the sets {W1} and m m {Z1}, although they block the same set of paths in the graph, are not c-equivalent; theyfailbothconditionsofTheorem2.

Tests for c-equivalence (27) are fairly easy to perform, and they can also be assisted by propensity scores methods. The informationthat such tests provide can be as powerful as conditional independence tests. The statistical ramification ofsuch tests areexplicatedin (PearlandPaz, 2009).

3.3.3 General control of confounding

Adjusting for covariates is only one of many methods that permits us to estimate causal effects in nonexperimental studies. Pearl (1995) has presented examples in whichthereexists nosetofvariablesthatissufficientforadjustmentandwherethe causal effect can nevertheless be estimated consistently. The estimation, in such cases, employs multi-stage adjustments. For example, if W3 is the only observed covariate in the model of Fig. 4, then there exists no sufficient set for adjustment (because no set of observed covariates can block the paths from X to Y through Z3), yet P(y|do(x)) can be estimated intwo steps; firstwe estimate P(w3|do(x))= P(w3|x)(byvirtueofthefactthatthereexistsnounblockedback-doorpathfromX toW3), second we estimate P(y|do(w3)) (since X constitutes a sufficientset forthe effectofW3 onY)and, finally,wecombinethe twoeffectstogetherandobtain P(y|do(x))=∑P(w3|do(x))P(y|do(w3)) (28) w3 Inthis example, thevariableW3 acts as a “mediatinginstrumentalvariable” (Pearl, 1993b,Chalak andWhite, 2006).

The analysis used in the derivation and validation of such results invokes mathematical rules of transforming causal quantities, represented by expressions such as P(Y = y|do(x)), into do-free expressions derivable from P(z,x,y), since only do-free expressions are estimable from non-experimental data. When such a transformationisfeasible, we areensuredthat thecausal quantityisidentifiable.

Applications of this calculus to problems involving multiple interventions (e.g., time varying treatments), conditional policies, and surrogate experiments were developed in Pearl and Robins (1995), Kuroki and Miyakawa (1999), and Pearl(2000a,Chapters 3–4).

A more recent analysis (Tian and Pearl, 2002) shows that the key to iden- tifiabilitylies not in blocking paths between X andY but, rather, in blocking paths between X and itsimmediatesuccessors onthe pathwaystoY. Allexisting criteria foridentificationarespecial cases ofthe onedefinedinthe followingtheorem: Theorem 3 (Tianand Pearl,2002) A sufficientconditionfor identifyingthe causal effectP(y|do(x)) isthatevery pathbetweenX andanyofitschildrentraces atleast 7 onearrow emanatingfroma measured variable.

Forexample,ifW3istheonlyobservedcovariateinthemodelofFig.4,P(y|do(x)) can be estimated since every path from X toW3 (the only child of X) traces either thearrowX →W3, orthearrowW3→Y,bothemanatingfromameasuredvariable (W3).

Shpitser and Pearl (2006) have further extended this theorem by (1) pre- sentinganecessary andsufficientconditionforidentification,and(2)extendingthe condition from causal effects to any counterfactual expression. The correspond- ing unbiased estimands for these causal quantities are readable directly from the diagram.

Graph-basedmethodsforeffectidentificationundermeasurementerrorsare discussed in (Pearl,2009f, Herna´nandCole, 2009, Caiand Kuroki,2008).

3.3.4 From identificationtoestimation

The mathematical derivation of causal effect estimands, like Eqs. (25) and (28) is merely a first step toward computing quantitative estimates of those effects from finite samples, using the rich traditions ofstatistical estimation and machine learn- ing Bayesian as well as non-Bayesian. Although the estimands derived in (25) and (28) are non-parametric, this does not mean that one should refrain from us- ing parametric forms in the estimation phase of the study. Parameterization is in fact necessary when the dimensionality of a problem is high. For example, if the assumptions of Gaussian, zero-mean disturbances and additive interactions are deemed reasonable, then the estimand given in (28) can be converted to the prod- uctE(Y|do(x))=r r x,wherer isthe(standardized)coefficientofZ in W3X YW3·X YZ·X the regression ofY on Z and X. More sophisticated estimation techniques are the “marginalstructuralmodels”of(Robins,1999),andthe“propensityscore”method of(Rosenbaum and Rubin, 1983)which werefoundto be particularlyuseful when dimensionalityishighand dataaresparse (see Pearl(2009b,pp. 348–52)).

It should be emphasized, however, that contrary to conventional wisdom (e.g., (Rubin, 2007, 2009)), propensity score methods are merely efficient estima- torsofthe righthand side of(25);they entailthesame asymptoticbias, andcannot be expected to reduce bias in case the set S does not satisfy the back-door crite- rion(Pearl, 2000a, 2009c,d). Consequently, theprevailingpractice ofconditioning 7 Before applying this criterion, one may delete from the causal graph all nodes that are not ancestorsofY.

on as many pre-treatment measurements as possible should be approached with great caution; some covariates (e.g., Z3 in Fig. 3) may actually increase bias if in- cluded inthe analysis (see footnote16). Using simulationand parametricanalysis, HeckmanandNavarro-Lozano(2004)andWooldridge(2009)indeedconfirmedthe bias-raisingpotentialofcertaincovariatesinpropensity-scoremethods. Thegraph- icaltoolspresentedinthissectionunveilthecharacterofthese covariatesandshow preciselywhatcovariatesshould,andshouldnotbeincludedintheconditioningset forpropensity-scorematching(seealso (PearlandPaz, 2009, Pearl,2009e)).

3.4 Counterfactual analysis in structural models

NotallquestionsofcausalcharactercanbeencodedinP(y|do(x))typeexpressions, thusimplyingthatnotallcausalquestionscanbeansweredfromexperimentalstud- ies. Forexample, questionsofattribution(e.g., whatfractionofdeathcasesaredue to specific exposure?) or of susceptibility (what fraction of the healthy unexposed population would have gotten the disease had they been exposed?) cannot be an- swered from experimental studies, and naturally, this kind of questions cannot be expressed in P(y|do(x)) notation(8). To answer such questions, a probabilistic anal- ysis ofcounterfactualsis required, one dedicatedto the relation “Y would be y had X been x in situation U = u,” denoted \(Y_x(u) = y\). Remarkably, unknown to most economists and philosophers, structural equation models provide the formal inter- pretationandsymbolicmachineryforanalyzingsuchcounterfactualrelationships.

The key idea is to interpret the phrase “had X been x” as an instruction to make a minimal modification in the current model, which may have assigned X a different value, say X =x′, so as to ensure the specified condition X = x. Such a minimal modification amounts to replacing the equation for X by a constant x, as we have done in Eq. (6). This replacement permits the constant x to differ from theactual valueofX (namely \(f_X(z, u_X )\))withoutrenderingthesystem ofequations inconsistent, thus yielding a formalinterpretationof counterfactuals in multi-stage models, wherethedependentvariableinoneequationmaybe anindependentvari- ablein another.

8: The reason forthisfundamentallimitationis thatnodeath case can be tested twice, withand withouttreatment. For example, if we measure equal proportionsof deaths in the treatment and controlgroups,wecannottellhowmanydeathcasesareactuallyattributabletothetreatmentitself; it is quite possible that many of those who died under treatment would be alive if untreated and, simultaneously,manyofthosewhosurvivedwithtreatmentwouldhavediedifnottreated.

9: Connectionsbetweenstructuralequationsandarestrictedclassofcounterfactualswerefirstrec- ognizedbySimonandRescher(1966).ThesewerelatergeneralizedbyBalkeandPearl(1995),us- ingsurgeries(Eq.(29)),thuspermittingendogenousvariablestoserveascounterfactualantecedents. The term “surgery definition” was used in Pearl (2000a, Epilogue) and criticized by Cartwright (2007)andHeckman(2005),(seePearl(2009b,pp.362–3,374–9forrebuttals)).

Definition6(Unit-levelCounterfactuals–“surgical”definition,Pearl(2000a,p.98)) Let M be a structural model and M a modified version of M, with the equation(s) x of X replaced by X = x. Denote the solution for Y in the equations of M by the x symbolY (u). The counterfactualY (u) (Read: “The value ofY in unit u, had X Mx x been x”)isgiven by: ∆ Y (u)=Y (u). (29) x Mx In words: The counterfactualY (u) in model M is defined as the solution forY in x the“surgicallymodified”submodelM .

x Weseethattheunit-levelcounterfactualY (u), whichintheNeyman-Rubin x approachis treatedas aprimitive,undefinedquantity,isactuallya derivedquantity in the structural framework. The fact that we equate the experimental unit u with a vectorof backgroundconditions,U =u, in M, reflects the understandingthatthe name ofa unit orits identitydo notmatter; itis onlythe vectorU =u of attributes characterizingaunitwhichdeterminesitsbehaviororresponse. Aswegofromone unit to another, the laws ofnature, as they are reflectedin the functions f , f , etc.

X Y 10 remaininvariant;onlytheattributesU =uvaryfromindividualtoindividual.

To illustrate, consider the solution of Y in the modified model M of Eq.

x0 (6), which Definition 6 endows with the symbol Y (u ,u ,u ). This entity has x0 X Y Z a clear counterfactual interpretation, for it stands for the way an individual with characteristics (u X,u Y,u Z) would respond, had the treatment been x0, rather than the treatment x = f (z,u ) actually received by that individual. In our example, X X sinceY does notdependonu andu , wecan write: X Z Y x0(u)=Y x0(u Y,u X,u Z)= f Y(x0,u Y). (30) Ina similarfashion,we can derive Y z0(u)= f Y(f X(z0,u X),u Y),

10: The distinctionbetween general, or population-levelcauses (e.g., “Drinking hemlock causes death”) and singular or unit-level causes (e.g., “Socrates’ drinking hemlock caused his death”), whichmanyphilosophershaveregardedasirreconcilable(Eells,1991),introducesnotensionatall inthe structuraltheory. The twotypes ofsentences differmerely inthe level ofsituation-specific informationthat is brought to bear on a problem, that is, in the specificity of the evidence e that enters thequantityP(Yx =y|e). When e includesallfactorsu, we have a deterministic, unit-level causationonourhand;whenecontainsonlyafewknownattributes(e.g., age, income,occupation etc.) whileothersareassignedprobabilities,apopulation-levelanalysisensues.

X z0,y0(u)= f X(z0,u X), and so on. These examples reveal the counterfactual reading of each individual structuralequationinthemodelofEq.(5). Theequationx= f (z,u ),forexample, X X advertisestheempiricalclaimthat,regardlessofthevaluestakenbyothervariables inthe system, hadZ beenz0, X wouldtakeon noothervaluebutx= f X(z0,u X).

Clearly, the distributionP(u ,u ,u ) induces a well definedprobabilityon Y X Z the counterfactual eventY =y, as well as on joint counterfactual events, such as x0 ‘Y =y ANDY =y′,’ which are, in principle, unobservable if x0 ̸=x1. Thus, to x0 x1 answer attributional questions, such as whether Y would be y1 if X were x1, given that in fact Y is y0 and X is x0, we need to compute the conditional probability P(Y =y1|Y = y0,X = x0) which is well defined once we know the forms of the x1 structural equations and the distribution of the exogenous variables in the model.

Forexample, assuminglinearequations (asinFig. 1), x=u y=βx+u , X X the conditioning eventsY =y0 and X =x0 yieldU =x0 andU =y0−βx0, and X Y wecanconcludethat,withprobabilityone,Y musttakeonthevalue: Y =βx1+ x1 x1 U =β(x1−x0)+y0. In otherwords, ifX were x1 instead ofx0,Y wouldincrease Y by β times the difference (x1−x0). In nonlinear systems, the result would also depend on the distribution of {U ,U } and, for that reason, attributional queries X Y are generally not identifiable in nonparametricmodels (see Section 6.3 and 2000a, Chapter9).

In general, if x and x′ are incompatible thenY andY cannot be measured x x′ simultaneously, and it may seem meaningless to attribute probability to the joint statement “Y would be y if X = x and Y would be y′ if X =x′.”11 Such concerns have been a source of objections to treating counterfactuals as jointly distributed randomvariables(Dawid,2000). ThedefinitionofY andY intermsoftwodistinct x x′ submodelsneutralizestheseobjections(Pearl,2000b),sincethecontradictoryjoint statement is mapped into an ordinary event, one where the background variables satisfy both statements simultaneously, each in its own distinct submodel; such events havewell definedprobabilities.

The surgical definition of counterfactuals given by (29), provides the con- ceptual and formal basis for the Neyman-Rubin potential-outcome framework, an approach to causation that takes a controlled randomized trial (CRT) as its rul- ing paradigm, assuming that nothing is known to the experimenter about the sci- ence behind the data. This “black-box” approach, which has thus far been denied the benefits of graphical or structuralanalyses, was developed by statisticians who 11 Forexample,“Theprobabilityis80%thatJoebelongstotheclassofpatientswhowillbecured iftheytakethedruganddieotherwise.” found it difficultto cross the two mental barriers discussed in Section 2.2. Section 5 establishes the precise relationship between the structural and potential-outcome paradigms, and outlines how the latter can benefit from the richer representational poweroftheformer.

4. Methodological Principles of Causal Inference

Thestructuraltheorydescribedintheprevioussectionsdictatesaprincipledmethod-

ologythateliminatesmuchoftheconfusionconcerningtheinterpretationsofstudy results as well as the ethical dilemmas that this confusion tends to spawn. The methodology dictates that every investigation involving causal relationships (and this entails the vast majority of empirical studies in the health, social, and behav- ioralsciences) shouldbe structuredalongthe followingfour-stepprocess: 1. Define: Express the target quantity Q as a function Q(M) that can be com- putedfromany modelM.

  1. Assume: Formulate causal assumptions using ordinary scientific language andrepresenttheirstructuralpartingraphicalform.

  2. Identify: Determine if the target quantity is identifiable (i.e., expressible in termsofestimableparameters).

  3. Estimate: Estimate the target quantity if it is identifiable, or approximate it, if it is not. Test the statistical implications of the model, if any, and modify themodelwhen failureoccurs.

4.1 Defining the target quantity

The definitionalphase is the most neglected step in currentpractice of quantitative analysis. The structural modeling approach insists on defining the target quantity, be it “causal effect,” “mediated effect,” “effect on the treated,” or “probability of causation”beforespecifyinganyaspectofthemodel,withoutmakingfunctionalor distributionalassumptions andpriortochoosinga methodofestimation.

The investigator should view this definition as an algorithm that receives a model M as an input and delivers the desired quantity Q(M) as the output. Surely, suchalgorithmshouldnotbetailoredtoanyaspectoftheinputM;itshouldbegen- eral, and ready to accommodate any conceivable model M whatsoever. Moreover, the investigator should imagine that the input M is a completely specified model, withallthefunctions f , f ,…andalltheU variables(ortheirassociatedprobabil- X Y ities) given precisely. This is the hardest step for statistically trained investigators to make; knowing in advance that such model details will never be estimable from the data, the definition ofQ(M) appears like a futile exercise in fantasyland – it is not.

For example, the formaldefinitionof the causal effectP(y|do(x)), as given in Eq. (7), is universally applicable to all models, parametric as well as nonpara- metric,throughtheformationofasubmodelM . Bydefiningcausaleffectprocedu- x rally, thus divorcing it from its traditional parametric representation, the structural theoryavoids the many pitfallsand confusions thathave plagued the interpretation 12 ofstructuralandregressionalparametersforthepast half century.

4.2 Explicating causal assumptions

This is the second most neglected step in causal analysis. In the past, the diffi- culty has been the lack of a language suitable for articulating causal assumptions which,asidefromimpedinginvestigatorsfromexplicatingassumptions, alsoinhib- itedthemfromgivingcausal interpretationstotheirfindings.

Structural equation models, in their counterfactual reading, have removed this lingering difficultyby providing the needed language forcausal analysis. Fig- ures 3 and 4 illustrate the graphical component of this language, where assump- tions are conveyed through the missing arrows in the diagram. If numerical or functional knowledge is available, for example, linearity or monotonicity of the functions f , f ,…, those are stated separately, and applied in the identification X Y and estimation phases of the study. Today we understand that the longevity and naturalappeal ofstructuralequationsstem fromthe factthatthey permitinvestiga- tors to communicatecausal assumptions formallyand in the very same vocabulary inwhichscientific knowledgeisstored.

Unfortunately, however, this understanding is not shared by all causal ana- lysts;someanalystsvehementlyopposethere-emergenceofstructure-basedcausa- tionandinsist, instead,onarticulatingcausalassumptionsexclusivelyintheunnat- ural(thoughformallyequivalent)languageof“potential outcomes,” “ignorability,” “missing data,” “treatmentassignment,” and other metaphors borrowedfromclini- caltrials. Thismodernassaultonstructuralmodelsisperhapsmoredangerousthan the regressional invasion that distorted the causal readings of these models in the 12 NotethatβinEq.(1),theincrementalcausaleffectofX onY,isdefinedprocedurallyby ∆ ∂ ∂ β=E(Y|do(x0+1))−E(Y|do(x0))= E(Y|do(x))= E(Yx).

∂x ∂x Naturally,allattemptstogiveβstatisticalinterpretationhaveendedinfrustrations(Holland,1988, Whittaker,1990,Wermuth,1992,WermuthandCox,1993),somepersistingwellintothe21stcen- tury(Sobel,2008).

late1970s(Richard,1980). Whilesanctioningcausalinferenceinoneidiosyncratic style of analysis, the modern assault denies validity to any other style, including structural equations, thus discouraging investigators fromsubjecting models to the scrutinyofscientificknowledge.

This exclusivist attitude is manifested in passages such as: “The crucial ideais toset upthe causal inferenceproblemas oneofmissingdata” or“Ifaprob- lem of causal inferencecannot be formulatedin this manner(as the comparison of potential outcomes under different treatment assignments), it is not a problem of inference for causal effects, and the use of “causal” should be avoided,” or, even morebluntly,“theunderlyingassumptionsneededtojustifyanycausalconclusions should be carefully and explicitly argued, not in terms of technical properties like “uncorrelated error terms,” but in terms of real world properties, such as how the units received the different treatments” (Wilkinson, the Task Force on Statistical Inference,andAPA Board ofScientificAffairs, 1999).

The methodology expounded in this paper testifies against such restric- tions. Itdemonstratesthe viabilityandscientific soundness of thetraditionalstruc- turalequationsparadigm,whichstandsdiametricallyopposedtothe“missingdata” paradigm. It renders the vocabulary of “treatment assignment” stifling and irrele- vant (e.g., there is no “treatment assignment” in sex discrimination cases). Most importantly, it strongly prefers the use of “uncorrelated error terms,” (or “omitted factors”)over its“strongignorability”alternative,as the properwayofarticulating causal assumptions. Even the most devout advocates of the “strong ignorability” language use “omitted factors” when the need arises to defend assumptions (e.g., (Sobel,2008))

4.3 Identification, estimation, and approximation

Having unburden itself from parametric representations, the identification process in the structural framework proceeds either in the space of assumptions (i.e., the diagram) or in the space of mathematical expressions, after translating the graph- ical assumptions into a counterfactual language, as demonstrated in Section 5.3.

Graphical criteria such as those of Definition 3 and Theorem 3 permitthe identifi- cationofcausaleffectstobedecidedentirelywithinthegraphicaldomain,whereit canbenefitfromtheguidanceofscientificunderstanding. Identificationofcounter- factual queries, on the other hand, often require a symbiosis of both algebraic and graphical techniques. The nonparametric nature of the identification task (Defini- tion1)makesit clearthatcontrarytotraditionalfolkloreinlinearanalysis, it isnot the model that need be identified but the query Q – the target of investigation. It also provides a simple way of proving non-identifiability: the construction of two parameterizationof M, agreeing in P and disagreeing in Q, is sufficient to rule out identifiability.

When Q is identifiable, the structural frameworkalso delivers an algebraic expression for the estimand EST(Q) of the target quantity Q, examples of which are given in Eqs. (24) and (25), and estimation techniques are then unleashed as discussed in Section 3.3.4. An integral part of this estimation phase is a test for the testable implications, if any, of those assumptions in M that render Q identifi- able – there is no pointin estimatingEST(Q) ifthe data proves those assumptions false and EST(Q) turns out to be a misrepresentation of Q. Investigators should bereminded,however, thatonlyafraction,called“kernel,”oftheassumptions em- bodied in M are needed for identifying Q (Pearl, 2004), the rest may be violated in the data with no effect on Q. In Fig. 2, forexample, the assumption {U ⊥⊥U } Z X is not necessary for identifying Q = P(y|do(x)); the kernel {U ⊥⊥U ,U ⊥⊥U } Y Z Y X (togetherwith the missing arrows)is sufficient. Therefore, the testable implication of this kernel, Z⊥⊥Y|X, is all we need to test when our target quantity is Q; the assumption{U ⊥⊥U }need notconcernus.

Z X More importantly,investigators must keep in mind that only a tiny fraction of any kernel lends itself to statistical tests, the bulk of it must remain untestable, at the mercy of scientific judgment. In Fig. 2, for example, the assumption set {U ⊥⊥U ,U ⊥⊥U } constitutes a sufficient kernel for Q = P(y|do(x)) (see Eq.

X Z Y X (28)) yet it has no testable implications whatsoever. The prevailing practice of submitting an entire structural equation model to a “goodness of fit” test (Bollen, 1989) in support of causal claims is at odd with the logic of SCM (see (Pearl, 2000a,pp.144–5)). Alternativecausalmodelsusuallyexistthatmakecontradictory claims and, yet, possess identical statistical implications. Statistical test can be usedforrejectingcertainkernels,intherarecaseswheresuchkernelshavetestable implications, but thelion’s share of supportingcausal claims fallson the shoulders ofuntestedcausal assumptions.

Whenconditionsforidentificationarenotmet,thebestonecandoisderive bounds for the quantities of interest—namely, a range of possible values of Q that representsourignoranceaboutthedetailsofthedata-generatingprocessM andthat cannot be improvedwith increasing sample size. A classical example of non iden- tifiable model that has been approximatedby bounds, is the problemof estimating causal effect in experimental studies marred by non compliance, the structure of whichis giveninFig. 5.

Ourtask inthisexample isto findthehighestand lowestvaluesofQ ∆ Q=P(Y =y|do(x))=∑P(Y =y|X =x,U =u )P(U =u ) (31) X X X X u X subject to the equalityconstraints imposed by the observed probabilitiesP(x,y,|z), U U U Z X Y Z X Y Figure5: Causal diagramrepresentingtheassignment (Z), treatment(X), and out- come(Y) ina clinicaltrialwithimperfectcompliance.

where the maximization ranges over all possible functions P(u ,u ), P(y|x,u ) Y X X andP(x|z,u )that satisfythose constraints.

Y Realizing that units in this example fall into 16 equivalence classes, each representing a binary function X = f(z) paired with a binary function y = g(x), 13 BalkeandPearl(1997)wereabletoderiveclosed-formsolutionsforthesebounds.

They showed that, in certain cases, the derived bounds can yield significant infor- mationonthetreatmentefficacy. ChickeringandPearl(1997)furtherusedBayesian techniques (with Gibbs sampling) to investigate the sharpness of these bounds as a function of sample size. Kaufman, Kaufman, and MacLenose (2009) used this techniquetobounddirectand indirecteffects(see Section6.1).

5. The Potential Outcome Framework

ThissectioncomparesthestructuraltheorypresentedinSections1–3tothepotential-

outcomeframework,usually associated withthenames ofNeyman(1923)andRu- bin (1974), which takes the randomized experiment as its rulingparadigm and has appealed therefore to researchers who do not find that paradigm overly constrain- ing. This framework is not a contender for a comprehensive theory of causation forit is subsumed by the structural theory and excludes ordinarycause-effect rela- tionshipsfromitsassumption vocabulary. We hereexplicate thelogicalfoundation of the Neyman-Rubin framework, its formal subsumption by the structural causal model,andhowitcanbenefitfromtheinsightsprovidedbythebroaderperspective ofthestructuraltheory.

The primitive object of analysis in the potential-outcome frameworkis the unit-basedresponsevariable,denotedY (u), read: “thevaluethatoutcomeY would x obtaininexperimentalunitu,hadtreatmentX beenx.” Here, unit maystandforan individual patient, an experimental subject, or an agricultural plot. In Section 3.4 13 These equivalence classes were later called “principal stratification” by Frangakis and Rubin (2002).LooserboundswerederivedearlierbyRobins(1989)andManski(1990).

(Eq. (29) we saw that this counterfactual entity has a natural interpretation in the SCM; it is the solution for Y in a modified system of equations, where unit is in- terpreted a vector u of background factors that characterize an experimental unit.

Each structural equation model thus carries a collection of assumptions about the behaviorofhypotheticalunits, andthese assumptions permitus toderivethe coun- terfactual quantities of interest. In the potential-outcome framework, however, no equationsareavailableforguidanceandY (u)istakenasprimitive,thatis,anunde- x finedquantityintermsofwhichotherquantitiesaredefined;notaquantitythatcan be derivedfrom the model. Inthis sense the structuralinterpretationofY (u) given x in(29)providestheformalbasisforthepotential-outcomeapproach;theformation ofthesubmodelM explicatesmathematicallyhowthehypotheticalcondition“had x X beenx” is realized,and whatthelogicalconsequences areofsucha condition.

5.1 The “black-box” missing-data paradigm

Thedistinctcharacteristicofthepotential-outcomeapproachisthat,althoughinves- tigatorsmustthinkandcommunicateintermsofundefined,hypotheticalquantities such asY (u), the analysis itself is conducted almost entirely within the axiomatic x framework of probability theory. This is accomplished, by postulating a “super” probability function on both hypothetical and real events. IfU is treated as a ran- domvariablethenthevalueofthe counterfactualY (u) becomes arandomvariable x as well,denotedasY . The potential-outcomeanalysis proceedsbytreatingthe ob- x serveddistributionP(x1,…,x n)asthemarginaldistributionofanaugmentedproba- bilityfunctionP∗ definedoverbothobservedandcounterfactualvariables. Queries about causal effects (written P(y|do(x)) in the structural analysis) are phrased as queries about the marginal distribution of the counterfactual variable of interest, writtenP∗(Y =y). ThenewhypotheticalentitiesY aretreatedasordinaryrandom x x variables;forexample,theyareassumedtoobeytheaxiomsofprobabilitycalculus, thelaws ofconditioning,andthe axiomsofconditionalindependence.

Naturally, these hypothetical entities are not entirely whimsy. They are as- sumed to be connected to observed variables via consistency constraints (Robins, 1986)suchas X =x =⇒ Y =Y, (32) x whichstatesthat,foreveryu,iftheactualvalueofX turnsouttobex,thenthevalue thatY would take on if ‘X were x’ is equal to the actual value ofY. For example, a person who chose treatment x and recovered, wouldalso have recovered if given treatmentx by design. When X is binary, it is sometimes more convenient to write (32)as: Y =xY1+(1−x)Y0 Whetheradditionalconstraintsshouldtietheobservablestotheunobservablesisnot aquestionthatcanbeansweredinthepotential-outcomeframework;foritlacksan underlyingmodeltodefine itsaxioms.

Themainconceptualdifferencebetweenthetwoapproachesisthat,whereas the structuralapproachviews the interventiondo(x) as an operationthatchanges a distributionbutkeepsthevariablesthesame,thepotential-outcomeapproachviews the variable Y under do(x) to be a different variable, Y , loosely connected to Y x through relations such as (32), but remaining unobserved whenever X ̸= x. The problem of inferring probabilistic properties of Y , then becomes one of “missing- x data” forwhich estimation techniques have been developed in the statistical litera- ture.

Pearl(2000a, Chapter7)shows, using thestructuralinterpretationofY (u), x that it is indeed legitimate to treat counterfactuals as jointly distributed random variables in all respects, that consistency constraints like (32) are automatically satisfied in the structural interpretation and, moreover, that investigators need not beconcerned aboutanyadditionalconstraintsexcept thefollowingtwo Y =y forally, subsets Z, andvalues z forZ (33) yz X =x⇒Y =Y forallx, subsets Z, and valuesz forZ (34) z xz z Equation(33)ensures thatthe interventionsdo(Y =y) resultsin theconditionY = y, regardless of concurrent interventions, say do(Z = z), that may be applied to variables other than Y. Equation (34) generalizes (32) to cases where Z is held fixed, atz. (See(Halpern,1998)forproofofcompleteness.)

5.2 Problem formulation and the demystification of “ignorabil-

ity” The main drawback of this black-box approach surfaces in problem formulation, namely, the phase where a researcher begins to articulate the “science” or “causal assumptions” behind the problem of interest. Such knowledge, as we have seen in Section 1, must be articulated at the onset of every problem in causal analysis – causal conclusions are only as valid as the causal assumptions upon which they rest.

To communicate scientific knowledge, the potential-outcome analyst must express assumptions as constraints on P∗, usually in the form of conditional in- dependence assertions involving counterfactual variables. For instance, in our ex- ample of Fig. 5, to communicate the understanding that Z is randomized (hence independent ofU andU ), the potential-outcome analyst would use the indepen- X Y 14 dence constraint Z⊥⊥ {Y ,Y ,…,Y }. To further formulate the understanding z1 z2 zk that Z does not affect Y directly, except through X, the analyst would write a, so called, “exclusion restriction”:Y =Y .

xz x A collection of constraints of this type might sometimes be sufficient to permita unique solutionto thequery ofinterest. For example, ifone can plausibly assume that, inFig.4, a setZ ofcovariatessatisfies theconditionalindependence Y ⊥⊥X|Z (35) x (anassumptiontermed“conditionalignorability”byRosenbaumandRubin(1983),) thenthe causaleffectP(y|do(x))=P∗(Y =y) can readilybe evaluatedtoyield x P∗(Y =y) = ∑P∗(Y =y|z)P(z) x x z = ∑P∗(Y =y|x,z)P(z) (using(35)) x z = ∑ P∗(Y =y|x,z)P(z) (using (32)) z ∑ = P(y|x,z)P(z). (36) z Thelastexpressioncontainsnocounterfactualquantities(thuspermittingustodrop theasteriskfromP∗)andcoincidespreciselywiththestandardcovariate-adjustment formulaofEq. (25).

Weseethattheassumptionofconditionalignorability(35)qualifiesZ asan admissible covariate for adjustment; it mirrors therefore the “back-door” criterion of Definition 3, which bases the admissibility of Z on an explicit causal structure encodedin thediagram.

The derivationabove may explain why the potential-outcomeapproachap- peals to mathematical statisticians; instead of constructing new vocabulary (e.g., arrows), new operators (do(x)) and new logic forcausal analysis, almost all math- ematical operations in this framework are conducted within the safe confines of probability calculus. Save for an occasional application of rule (34) or (32)), the analyst may forget thatY stands fora counterfactualquantity—it is treated as any x otherrandomvariable,and theentirederivationfollowsthecourse ofroutineprob- abilityexercises.

This orthodoxy exacts a high cost: Instead of bringing the theory to the problem,theproblemmustbereformulatedtofitthetheory;allbackgroundknowl- edge pertaining to a given problem must first be translated into the language of 14 ThenotationY⊥⊥X|ZstandsfortheconditionalindependencerelationshipP(Y =y,X=x|Z= z)=P(Y =y|Z=z)P(X =x|Z=z)(Dawid,1979).

counterfactuals(e.g., ignorability conditions)before analysis can commence. This translation may in fact be the hardest part of the problem. The reader may ap- preciate this aspect by attempting to judge whether the assumption of conditional ignorability (35), the key to the derivation of (36), holds in any familiar situation, say intheexperimentalsetupofFig.2(a). Thisassumption reads: “thevaluethatY would obtain had X been x, is independent of X, given Z”. Even the most experi- encedpotential-outcomeexpertwouldbeunabletodiscernwhetheranysubsetZ of 15 covariates in Fig. 4 wouldsatisfy this conditionalindependence condition. Like- wise, toderive Eq. (35)in the languageof potential-outcome(see (Pearl,2000a, p.

223)), one wouldneed to convey the structure of the chain X →W3 →Y using the cryptic expression: W3 x⊥⊥ {Y w3,X}, read: “the value that W3 would obtain had X beenx is independentofthevaluethatY wouldobtainhadW3 been w3 jointlywith thevalueofX.” Suchassumptions arecastinalanguageso farremovedfromordi- naryunderstandingofscientifictheoriesthat,forallpracticalpurposes,theycannot be comprehended or ascertained by ordinary mortals. As a result, researchers in thegraph-lesspotential-outcomecamprarelyuse“conditionalignorability”(35)to guide the choice of covariates; they view this condition as a hoped-for miracle of 16 natureratherthanatargetto beachieved byreasoneddesign.

Replacing “ignorability” with a conceptually meaningful condition (i.e., back-door)ina graphicalmodelpermitsresearchers tounderstandwhatconditions covariates must fulfill before they eliminate bias, what to watch for and what to think about when covariates are selected, and what experiments we can do to test, atleast partially,ifwehave theknowledgeneeded forcovariateselection.

Aside fromofferingno guidanceincovariateselection, formulatingaprob- lem in the potential-outcome language encounters three additional hurdles. When counterfactual variables are not viewed as byproducts of a deeper, process-based model, it is hard to ascertain whether all relevantjudgments have been articulated, whether the judgments articulated are redundant, or whether those judgments are self-consistent. The need to express, defend, and manage formidable counterfac- tualrelationshipsofthistypeexplaintheslowacceptance ofcausalanalysisamong health scientists and statisticians, and why most economists and social scientists 15 InquisitivereadersareinvitedtoguesswhetherXz⊥⊥Z|Y holdsinFig.2(a),thenreflectonwhy causalityissoslowinpenetratingstatisticaleducation.

16 The opaqueness of counterfactual independencies explains why many researchers within the potential-outcome camp are unaware of the fact that adding a covariate to the analysis (e.g., Z3 in Fig. 4, Z in Fig. 5 may actually increase confoundingbias in propensity-scorematching. Paul Rosenbaum,forexample,writes:“thereislittleornoreasontoavoidadjustmentforatruecovariate, a variable describingsubjects before treatment” (Rosenbaum, 2002, p. 76). Rubin(2009)goes as far as stating that refraining from conditioningon an available measurement is “nonscientific ad hockery”foritgoes againstthetenetsof Bayesianphilosophy(see (Pearl, 2009c,d,Heckman and Navarro-Lozano,2004)foradiscussionofthisfallacy).

continue to use structural equation models (Wooldridge, 2002, Stock and Watson, 2003, Heckman, 2008) instead of the potential-outcome alternatives advocated in Angrist,Imbens, andRubin (1996),Holland(1988),Sobel(1998,2008).

On the other hand, the algebraic machinery offered by the counterfactual notation,Y (u), once a problemis properly formalized, can be extremely powerful x in refiningassumptions (Angrist et al., 1996, Heckman and Vytlacil, 2005), deriv- ing consistent estimands (Robins, 1986), bounding probabilities of necessary and sufficientcausation (Tian and Pearl, 2000), and combiningdata fromexperimental and nonexperimental studies (Pearl, 2000a). The next subsection (5.3) presents a way of combiningthe best features of the two approaches. It is based on encoding causal assumptions in thelanguage ofdiagrams, translatingthese assumptionsinto counterfactual notation, performing the mathematics in the algebraic language of counterfactuals (using (32), (33), and (34)) and, finally, interpreting the result in graphicaltermsor plaincausal language. The mediationproblemof Section6.1 il- lustrates howsuchsymbiosis clarifiesthe definitionand identificationofdirectand 17 indirect effects, and how it overcomes difficultiesthat were deemed insurmount- ablein theexclusivist potential-outcomeframework(Rubin,2004, 2005).

5.3 Combining graphs and potential outcomes

The formulation of causal assumptions using graphs was discussed in Section 3.

In this subsection we will systematize the translation of these assumptions from graphstocounterfactualnotation.

Structuralequationmodelsembodycausalinformationinboththeequations and the probability function P(u) assigned to the exogenous variables; the former is encoded as missing arrowsin the diagrams the latter as missing (double arrows) dashed arcs. Each parent-childfamily(PA ,X) in a causal diagram G corresponds i i to an equation in the model M. Hence, missing arrows encode exclusion assump- tions,thatis, claimsthatmanipulatingvariablesthatareexcludedfromanequation willnotchangetheoutcomeofthehypotheticalexperimentdescribedbythatequa- tion. Missingdashedarcsencodeindependenciesamongerrortermsintwoormore equations. For example, the absence of dashed arcs between a nodeY and a set of nodes {Z1,…,Z k} implies that the corresponding background variables, U and Y {U ,…,U }, areindependentinP(u).

Z1 Zk 17 Suchsymbiosisisnowstandardinepidemiologyresearch (Robins,2001,Petersen,Sinisi,and vander Laan, 2006,VanderWeele andRobins,2007,Hafeman and Schwartz, 2009,VanderWeele, 2009)yetstilllackingineconometrics(Heckman,2008,ImbensandWooldridge,2009).

Theseassumptionscanbetranslatedintothepotential-outcomenotationus- ing two simple rules (Pearl, 2000a, p. 232); the first interprets the missing arrows inthe graph,thesecond, the missingdashed arcs.

  1. Exclusionrestrictions: ForeveryvariableY havingparentsPA andforevery Y set ofendogenousvariablesS disjointofPA , wehave Y Y =Y . (37) pa pa ,s Y Y
  2. Independence restrictions: If Z1,…,Z is any set of nodes not connected to k Y viadashed arcs, and PA1,…,PA theirrespective sets ofparents, wehave k Y Y⊥⊥ {Z1 pa1,…,Z pak}. (38) pa k The exclusion restrictions expresses the fact that each parent set includes all direct causes of the child variable, hence, fixing the parents of Y, determines the value ofY uniquely, and intervention on any other set S of (endogenous) vari- ables can no longer affectY. The independence restriction translates the indepen- dence between U and {U ,…,U } into independence between the correspond- Y Z1 Zk ing potential-outcome variables. This follows from the observation that, once we set theirparents, thevariables in{Y,Z1,…,Z k}stand in functionalrelationshipsto theU termsintheircorrespondingequations.

As an example, consider the model shown in Fig. 5, which serves as the canonical representation for the analysis of instrumental variables (Angrist et al., 1996, BalkeandPearl,1997). This modeldisplays the followingparentsets: PA ={0/}, PA ={Z}, PA ={X}. (39) Z X Y Consequently, theexclusion restrictionstranslate into: X = X z yz Z = Z =Z =Z (40) y xy x Y = Y x xz the absence of any dashed arc between Z and {Y,X} translates into the indepen- dence restriction Z⊥⊥ {Y ,X }. (41) x z This is precisely the condition of randomization; Z is independent of all its non- descendants, namely independent of U and U which are the exogenous parents X Y ofY and X, respectively. (Recallthatthe exogenous parentsofany variable,sayY, maybe replacedbythecounterfactualvariableY , because holdingPA constant pa Y Y rendersY a deterministicfunctionofitsexogenous parentU .) Y The roleofgraphsisnotendedwiththeformulationofcausalassumptions.

Throughout an algebraic derivation, like the one shown in Eq. (36), the analyst may need toemployadditionalassumptions that areentailed by theoriginalexclu- sionandindependenceassumptions, yetarenotshownexplicitlyintheirrespective algebraicexpressions. Forexample,itishardlystraightforwardtoshowthattheas- sumptions of Eqs. (40)–(41)imply the conditional independence (Y ⊥⊥Z|{X ,X}) x z butdonotimplytheconditionalindependence(Y ⊥⊥Z|X). These arenoteasilyde- x rivedbyalgebraicmeansalone. Suchimplicationscan, however,easilybetestedin the graph of Fig. 5 using the graphical reading for conditional independence (Def- inition 1). (See (Pearl, 2000a, pp. 16–17, 213–215).) Thus, when the need arises to employ independencies in the course of a derivation, the graph may assist the procedure by vividly displaying the independencies that logically follow from our assumptions.

6. Counterfactuals at Work

6.1 Mediation: Direct and indirect effects

6.1.1 Direct versustotaleffects

The causal effect we have analyzed so far, P(y|do(x)), measures the total effect of a variable (or a set of variables) X on a response variable Y. In many cases, this quantity does not adequately represent the target of investigation and attention is focused instead on the direct effect of X on Y. The term “direct effect” is meant to quantify an effect that is not mediated by other variables in the model or, more accurately,thesensitivityofY tochangesinX whileallotherfactorsintheanalysis are held fixed. Naturally, holding those factors fixed would sever all causal paths from X toY with the exception of the direct link X →Y, which is not intercepted byany intermediaries.

A classical example of the ubiquity of direct effects involves legal disputes over race or sex discrimination in hiring. Here, neither the effect of sex or race on applicants’ qualification nor the effect of qualification on hiring are targets of litigation. Rather, defendantsmustprovethatsexandracedonotdirectlyinfluence hiring decisions, whatever indirect effects they might have on hiring by way of applicantqualification.

From a policy making viewpoint, an investigator may be interested in de- composing effects to quantify the extent to which racial salary disparity is due to educationaldisparity,or,takingahealth-careexample,theextenttowhichsensitiv- itytoagivenexposure canbereducedby eliminatingsensitivitytoanintermediate factor, standing between exposure and outcome. Another example concerns the identification of neural pathways in the brain or the structural features of protein- signaling networks in molecular biology (Brent and Lok, 2005). Here, the decom- position of effects into their direct and indirect components carries theoretical sci- entific importance, for it tells us “how nature works” and, therefore, enables us to predictbehaviorundera richvarietyofconditions.

Yetdespiteitsubiquity,theanalysisofmediationhas longbeen athornyis- sueinthesocialandbehavioralsciences(JuddandKenny,1981,BaronandKenny, 1986, Muller, Judd, and Yzerbyt, 2005, Shrout and Bolger, 2002, MacKinnon, Fairchild,andFritz,2007a)primarilybecausestructuralequationmodelinginthose sciences were deeply entrenched in linear analysis, where the distinction between 18 causal parameters and their regressional interpretations can easily be conflated.

As demands grew to tackle problems involving binary and categorical variables, researchers could no longer define direct and indirect effects in terms of structural or regressional coefficients, and all attempts to extend the linear paradigms of ef- fect decomposition to non-linear systems produced distorted results (MacKinnon, Lockwood, Brown, Wang, and Hoffman, 2007b). These difficulties have accentu- ated the need to redefine and derive causal effects fromfirst principles, uncommit- ted to distributional assumptions or a particular parametric form of the equations.

The structural methodology presented in this paper adheres to this philosophy and it has produced indeed a principled solution to the mediation problem, based on the counterfactual reading of structural equations (29). The following subsections summarizethemethodand itssolution.

6.1.2 Controlleddirect-effects

A major impediment to progress in mediation analysis has been the lack of no- tational facility for expressing the key notion of “holding the mediating variables fixed” in the definition of direct effect. Clearly, this notion must be interpreted as (hypothetically)settingtheintermediatevariablestoconstantsbyphysicalinterven- tion, not by analytical means such as selection, regression, conditioning, matching oradjustment. Forexample, considerthesimplemediationmodelsofFig.6,where theerrorterms(notshownexplicitly)are assumed tobeindependent. Itwillnotbe sufficienttomeasure theassociation between gender(X)and hiring(Y)foragiven 18 Allarticlescitedabovedefinethedirectandindirecteffectsthroughtheirregressionalinterpre- tations; I am notaware of any article in thistraditionthat formallyadapts a causal interpretation, freeofestimation-specificparameterization.

W W 1 2 Z Z X Y X Y (a) (b) Figure6: (a)A genericmodeldepictingmediationthroughZ withnoconfounders, and(b)withtwoconfounders,W1 andW2.

level of qualification (Z), (see Fig. 6(b)) because, by conditioning on the mediator Z, we create spurious associations between X andY throughW2, even when there is nodirecteffectofX onY (Pearl,1998, ColeandHerna´n, 2002).

Usingthedo(x)notation,enablesustocorrectlyexpressthenotionof“hold- ing Z fixed” and obtain a simple definition of the controlled direct effect of the transitionfromX =x toX =x′: ∆ CDE =E(Y|do(x),do(z))−E(Y|do(x′),do(z)) or,equivalently,usingcounterfactualnotation: ∆ CDE =E(Y )−E(Y ) xz x′z where Z is the set of all mediating variables. The readers can easily verify that, in linearsystems, thecontrolleddirecteffectreducestothepathcoefficientofthelink X →Y (see footnote 12) regardless of whether confounders are present (as in Fig.

6(b))andregardless ofwhethertheerrortermsare correlatedornot.

Thisseparatesthetaskofdefinitionfromthatofidentification,asdemanded by Section 4.1. The identification of CDE would depend, of course, on whether confounders are present and whether they can be neutralized by adjustment, but these do not alter its definition. Nor should trepidation about infeasibility of the action do(gender=male) enter the definitionalphase of the study, Definitions ap- ply to symbolic models, not to human biology. Graphical identificationconditions for expressions of the type E(Y|do(x),do(z1),do(z2),…,do(z k)) in the presence of unmeasured confounders were derived by Pearl and Robins (1995) (see Pearl (2000a, Chapter 4) and invoke sequential application of the back-door conditions discussed in Section3.2.

6.1.3 Naturaldirecteffects

Inlinearsystems, thedirecteffectisfullyspecified bythepath coefficientattached to the link from X toY; therefore, the direct effect is independent of the values at whichwe hold Z. In nonlinearsystems, those values would,in general, modifythe effect of X onY and thus should be chosen carefully to represent the target policy under analysis. For example, it is not uncommon to find employers who prefer males for the high-paying jobs (i.e., high z) and females for low-paying jobs (low z).

WhenthedirecteffectissensitivetothelevelsatwhichweholdZ,itisoften moremeaningfultodefinethedirecteffectrelativetosome“natural”base-linelevel thatmayvaryfromindividualtoindividual,andrepresentsthelevelofZ justbefore the change in X. Conceptually, we can define the natural direct effect DE (Y) x,x′ as the expected change inY induced by changing X fromx to x′ while keeping all mediatingfactorsconstantatwhatevervaluetheywouldhaveobtainedunderdo(x).

Thishypotheticalchange,whichRobinsandGreenland(1992)conceivedandcalled “pure”andPearl(2001)formalizedandanalyzedundertherubric“natural,”mirrors what lawmakers instruct us to consider in race or sex discrimination cases: “The central question in any employment-discrimination case is whether the employer would have taken the same action had the employee been of a different race (age, sex, religion, national origin etc.) and everything else had been the same.” (In Carson versus Bethlehem SteelCorp., 70FEPCases 921,7th Cir. (1996)).

Extending the subscript notation to express nested counterfactuals, Pearl (2001)gave aformaldefinitionforthe“naturaldirecteffect”: DE (Y)=E(Y )−E(Y ). (42) x,x′ x′,Zx x Here,Y representsthevaluethatY wouldattainundertheoperationofsettingX x′,Zx tox′ and, simultaneously, settingZ towhatever valueitwouldhaveobtained under the setting X =x. We see that DE (Y), the natural direct effect of the transition x,x′ fromx to x′, involves probabilitiesof nested counterfactuals and cannot be written intermsofthedo(x)operator. Therefore,thenaturaldirecteffectcannotingeneral beidentified,evenwiththehelpofideal,controlledexperiments(seefootnote8for intuitiveexplanation). However,aidedbythesurgicaldefinitionofEq. (29)andthe notational power of nested counterfactuals, Pearl (2001) was nevertheless able to showthat,ifcertainassumptionsof“noconfounding”aredeemedvalid,thenatural directeffectcan be reducedto DE (Y)=∑ [E(Y|do(x′,z))−E(Y|do(x,z))]P(z|do(x)). (43) x,x′ z The intuitionis simple;the naturaldirecteffectis theweighted average ofthecon- trolleddirecteffect,using thecausal effectP(z|do(x)) as aweighingfunction.

Oneconditionforthevalidityof(43)is thatZ ⊥⊥Y |W holdsforsome set x x′,z W of measured covariates. This technical condition in itself, like the ignorability conditionof(35),isclosetomeaninglessformostinvestigators,asitisnotphrased in terms of realized variables. The surgical interpretation of counterfactuals (29) can be invoked at this point to unveil the graphical interpretationof this condition.

It states thatW should be admissible (i.e., satisfy the back-doorcondition) relative thepath(s)fromZ toY. Thiscondition,satisfiedbyW2 inFig.6(b),isreadilycom- prehended by empirical researchers, and the task of selecting such measurements, W, can then be guided by the available scientific knowledge. Additional graphical andcounterfactualconditionsforidentificationarederivedinPearl(2001)Petersen etal. (2006)andImai,Keele, andYamamoto(2008).

Inparticular,itcanbeshown(Pearl,2001)thatexpression(43)isbothvalid andidentifiableinMarkovianmodels(i.e.,nounobservedconfounders)whereeach term on the right can be reduced to a “do-free” expression using Eq. (24) or (25) andthen estimatedbyregression.

Forexample, forthe modelinFig.6(b),Eq. (43)reads: DE x,x′(Y)=∑∑P(w1)[E(Y|x′,z,w1))−E(Y|x,z,w1))]∑P(z|x,w2)P(w2). (44) z w1 w2 whilefortheconfounding-freemodelofFig. 6(a)we have: DE (Y)=∑ [E(Y|x′,z)−E(Y|x,z)]P(z|x). (45) x,x′ z Both(44)and(45)caneasily beestimated bya two-stepregression.

6.1.4 Naturalindirecteffects

Remarkably, the definition of the natural direct effect (42) can be turned around and provide an operational definition for the indirect effect – a concept shrouded in mystery and controversy, because it is impossible, using the do(x) operator, to disable the direct link from X to Y so as to let X influence Y solely via indirect paths.

Thenaturalindirecteffect,IE, ofthetransitionfromxtox′ isdefinedasthe expected change inY affected by holding X constant, at X =x, and changing Z to whatevervalueitwouldhaveattainedhadX beensettoX =x′. Formally,thisreads (Pearl,2001): ∆ IE (Y)=E[(Y )−E(Y )], (46) x,x′ x ,Z x′ x whichisalmostidenticaltothedirecteffect(Eq.(42))save forexchangingx andx′ inthe firstterm.

Indeed, it can be shown that, in general, the total effect TE of a transition isequal tothedifferencebetweenthe directeffectofthattransitionandtheindirect effectofthereversetransition. Formally, ∆ TE (Y)=E(Y −Y )=DE (Y)−IE (Y). (47) x,x′ x′ x x,x′ x′,x In linear systems, where reversal of transitions amounts to negating the signs of theireffects,we havethe standardadditiveformula TE (Y)=DE (Y)+IE (Y). (48) x,x′ x,x′ x,x′ Sinceeachtermaboveisbasedonanindependentoperationaldefinition,thisequal- ityconstitutesaformaljustificationfortheadditiveformulausedroutinelyinlinear systems.

Notethat,althoughitcannotbeexpressedindo-notation,theindirecteffect has clear policy-making implications. For example: in the hiring discrimination context, a policy makermay be interestedin predictingthegender mixinthe work force if gender bias is eliminated and all applicants are treated equally—say, the samewaythatmalesarecurrentlytreated. Thisquantitywillbegivenbytheindirect effect of gender on hiring, mediated by factors such as education and aptitude, whichmaybe gender-dependent.

More generally, a policy maker may be interested in the effect of issuing a directive to a select set of subordinate employees, or in carefully controlling the routingofmessages in a networkofinteractingagents. Suchapplications motivate the analysis of path-specific effects, that is, the effect of X onY through a selected set ofpaths(Avin, Shpitser,andPearl, 2005).

Inall these cases, the policy interventioninvokesthe selection of signals to be sensed, rather than variables to be fixed. Pearl (2001) has suggested therefore that signal sensing is more fundamental to the notion of causation than manipu- lation; the latter being but a crude way of stimulating the former in experimental setup. The mantra “No causation without manipulation” must be rejected. (See (Pearl,2009b,Section11.4.5).) It is remarkable that counterfactual quantities like DE and IE that could not be expressed in terms of do(x) operators, and appear thereforevoid of empiri- cal content, can, under certain conditions be estimated fromempirical studies, and servetoguidepolicies. Awarenessofthispotentialshouldemboldenresearchersto go through the definitional step of the study and freely articulate the target quan- tity Q(M) in the language of science, i.e., counterfactuals, despite the seemingly speculative natureofeach assumption inthemodel(Pearl,2000b).

6.2 The MediationFormula: a simplesolutionto a thornyprob-

lem This subsection demonstrateshow thesolutionprovidedinequations (45)and(48) can be applied to practical problems of assessing mediation effects in non-linear models. Wewilluse thesimplemediationmodelofFig.6(a),whereallerrorterms (not shown explicitly) are assumed to be mutually independent, with the under- standing that adjustment for appropriate sets of covariates W may be necessary to achievethisindependenceandthatintegralsshouldreplacesummationswhendeal- ingwithcontinuousvariables(Imaietal., 2008).

Combining (45) and (48), the expression for the indirect effect, IE, be- comes: IE (Y)=∑E(Y|x,z)[P(z|x′)−P(z|x)] (49) x,x′ z whichprovidesageneralformulaformediationeffects,applicabletoanynonlinear system, anydistribution(ofU),andanytypeofvariables. Moreover,theformulais readily estimable by regression. Owed to its generality and ubiquity, Iwill referto thisexpression as the“Mediation Formula.” The Mediation Formula represents the average increase in the outcome Y that the transition from X = x to X = x′ is expected to produce absent any direct effect of X on Y. Though based on solid causal principles, it embodies no causal assumption other than the generic mediation structure of Fig. 6(a). When the out- come Y is binary (e.g., recovery, or hiring) the ratio (1−IE/TE) represents the fraction of responding individuals who owe their response to direct paths, while (1−DE/TE) representsthefractionwhoowetheirresponse toZ-mediatedpaths.

The Mediation Formula tells us that IE depends only on the expectation of the counterfactual Y , not on its functional form f (x,z,u ) or its distribution xz Y Y P(Y = y). It calls therefore for a two-step regression which, in principle, can be xz performednon-parametrically. InthefirststepweregressY onX andZ, andobtain theestimate g(x,z)=E(Y|x,z) forevery(x,z)cell. Inthesecond stepweestimatetheexpectationofg(x,z) condi- tionalonX =x′ and X =x, respectively,andtakethe difference: IE (Y)=E (g(x,z)|x′)−E (g(x,z)|x) x,x′ z z Nonparametric estimation is not always practical. When Z consists of a vector of several mediators, the dimensionality of the problem would prohibit the estimation of E(Y|x,z) for every (x,z) cell, and the need arises to use parametric approximation. We can then choose any convenient parametric form for E(Y|x,z) (e.g., linear, logit, probit), estimate the parameters separately (e.g., by regression or maximum likelihood methods), insert the parametric approximation into (49) and estimate its two conditional expectations (over z) to get the mediated effect (VanderWeele, 2009).

Let us examine what the Mediation Formula yields when applied to both linearandnon-linearversionsofmodel6(a). Inthelinearcase,thestructuralmodel reads: x=u X z=b x+u (50) x Z y=c x+c z+u x z Y Computingtheconditionalexpectationin(49)gives E(Y|x,z)=E(c x+c z+u )=c x+c z x z Y x z andyields IE (Y)=∑ (c x+c z)[P(z|x′)−P(z|x)].

x,x′ x z z =c [E(Z|x′)−E(Z|x)] (51) z =(x′−x)(c b ) (52) z x =(x′−x)(b−c ) (53) x whereb isthe totaleffectcoefficient,b=(E(Y|x′)−E(Y|x))/(x′−x)=c +c b .

x z x We thus obtainedthe standard expressions forindirect effectsin linear sys- tems, which can be estimated either as a difference in two regression coefficients (Eq. 53) or a product of two regression coefficients (Eq. 52), with Y regressed on both X and Z. (see (MacKinnon et al., 2007b)). These twostrategies do not gener- alizeto non-linearsystem as weshall see next.

Supposeweapply(49)toanon-linearprocess (Fig.7)inwhichX,Y,andZ arebinaryvariables,andY andZ aregivenby theBooleanformula Y =AND (x,e )∨ AND (z,e ) x,z,e ,e =0,1 x z x z z=AND (x,e ) z,e =0,1 xz xz Such disjunctive interaction would describe, for example, a disease Y that would be triggered either by X directly, if enabled by e , or by Z, if enabled by e . Let x z us furtherassume that e ,e and e are three independent Bernoulli variables with x z xz probabilities p ,p ,and p , respectively.

x z xz e z x ( ~ p x z ) e z ( ~ p z ) Z AND AND AND OR Y X e ( ~ p ) x x Figure7: Stochasticnon-linearmodelofmediation. Allvariablesare binary.

As investigators, we are not aware, of course, of these underlying mecha- nisms; all we know is that X,Y, and Z are binary, that Z is hypothesized to be a mediator,andthattheassumption ofnonconfoundednesspermitsus touse theMe- diation Formula (49) for estimating the Z-mediated effect of X onY. Assume that our plan is to conduct a nonparametric estimation of the terms in (49) over a very largesampledrawnfromP(x,y.z); itisinterestingtoaskwhattheasymptoticvalue of the Mediation Formula wouldbe, as a functionof the model parameters: p ,p , x z and p .

xz Fromknowledgeoftheunderlyingmechanism, wehave: P(Z =1|x) = p x x=0,1 xz P(Y =1|x,z) = p x+p z−p p xz x,z=0,1 x z x z Therefore, E(Z|x) = p x x=0,1 xz E(Y|x,z) =xp +zp −xzp p x,z=0,1 x z x z E(Y|x) =∑ E(Y|x,z)P(z|x) z =xp +(p −xp p )E(Z|x) x z x z =x(p +p p −xp p p ) x=0,1 x xz z x z xz Taking x = 0,x′ = 1 and substituting these expressions in (45), (48), and (49)yields IE(Y)= p p (54) z xz DE(Y)= p (55) x TE(Y)= p p +p +p p p (56) z xz x x z xz Twoobservationsareworthnoting. First,weseethat,despitethenon-linear interactionbetweenthetwocausalpaths, theparametersofonedonotinfluenceon thecausaleffectmediatedbytheother. Second,thetotaleffectisnotthesumofthe directandindirecteffects. Instead, we have: TE =DE+IE−DE·IE whichmeansthatafractionDE·IE/TE ofoutcomecasestriggeredbythetransition from X =0 to X =1 are triggered simultaneously, through both causal paths, and wouldhave been triggeredeven ifone ofthepaths wasdisabled.

Now assume that we choose to approximate E(Y|x,z) by the linear expres- sion g(x,z)=a0+a1x+a2z. (57) After fitting the a’s parameters to the data (e.g., by OLS) and substituting in (49) onewouldobtain IE x,x′(Y) =∑ z(a0+a1x+a2z)[P(z|x′)−P(z|x)] (58) =a2[E(Z|x′)−E(Z|x)] whichholdswheneverweuse theapproximationin(57),regardlessoftheunderly- ingmechanism.

If the correct data-generating process was the linear model of (50), we wouldobtainthe expected estimatesa2 =c z,E(z|x′)−E(z|x′)=b x(x′−x) and IE (Y)=b c (x′−x).

x,x′ x z Ifhoweverwe weretoapplytheapproximationin(57)todatageneratedby the nonlinear model of Fig. 7, a distorted solution wouldensue; a2 would evaluate to a2 =∑ x[E(Y|x,z=1)−E(Y|x,z=0)]P(x) =P(x=1)[E(Y|x=1,z=1)−E(Y|x=1,z=0)] =P(x=1)[(p +p −p p )−p ] x z x z x =P(x=1)p (1−p ), z x E(z|x′)−E(z|x) wouldevaluateto p (x′−x), and(58)wouldyieldtheapproxima- xz tion IˆE x,x′(Y) =a2[E(Z|x′)−E(Z|x)] (59) = p P(x=1)p (1−p ) xz z x We see immediately that the result differs from the correct value p p de- z xz rived in (54). Whereas the approximate value depends on P(x = 1), the correct valueshowsnosuchdependence,andrightlyso;nocausal effectshoulddependon theprobabilityofthe causal variable.

Fortunately, the analysis permits us to examine under what condition the distortion would be significant. Comparing (59) and (54) reveals that the approxi- matemethodalwaysunderestimatestheindirecteffectandthedistortionisminimal forhighvaluesofP(x=1) and(1−p ).

x Had we chosen to include an interaction term in the approximation of E(Y|x,z), thecorrectresultwouldobtain. To witness, writing E(Y|x,z)=a0+a1x+a2z+a3xz, a2 wouldevaluateto p z, a3 to p xp z, andthe correctresultobtains through: IE x,x′(Y)=∑ (a0+a1x+a2z+a3xz)[P(z|x′)−P(z|x)] z =(a2+a3x)[E(Z|x′)−E(Z|x)] =(a2+a3x)p xz(x′−x) =(p −p p x)p (x′−x) z x z xz Weseethat,inadditiontoprovidingcausally-soundestimatesformediation effects,theMediation Formulaalso enablesresearchers toevaluateanalyticallythe effectiveness of various parametric specifications relative to any assumed model.

This type of analytical “sensitivity analysis” has been used extensively in statistics forparameterestimation,butcouldnotbeappliedtomediationanalysis,owedtothe absence of an objective target quantity that captures the notionof indirect effectin bothlinearandnon-linearsystems, freeofparametricassumptions. The Mediation FormulaofEq. (49)explicates this targetquantityformally,andcasts itinterms of estimablequantities.

ThederivationoftheMediationFormulawasfacilitatedbytakingseriously thefourstepsofthestructuralmethodology(Section4)togetherwiththegraphical- counterfactual-structuralsymbiosis spawnedbythesurgicalinterpretationofcoun- terfactuals(Eq. (29)).

In contrast, when the mediation problem is approached froman exclusivist potential-outcomeviewpoint,voidofthestructuralguidanceofEq.(29),counterin- tuitive definitions ensue, carrying the label “principal stratification” (Rubin, 2004, 2005), whichare at variance withcommon understandingofdirect and indirectef- fects. For example, the direct effect is definable only in units absent of indirect effects. This means that a grandfather would be deemed to have no direct effect on his grandson’s behavior in families where he has had some effect on the father.

This precludesfromtheanalysis alltypicalfamilies,inwhichafatherand agrand- father have simultaneous, complementary influences on children’s upbringing. In linear systems, to take an even sharper example, the direct effect would be unde- fined whenever indirect paths exist from the cause to its effect. The emergence of such paradoxical conclusions underscores the wisdom, if notnecessity ofa symbi- oticanalysis,inwhichthecounterfactualnotationY (u)isgovernedbyitsstructural x 19 definition,Eq. (29).

6.3 Causes of effects and probabilities of causation

The likelihood that one event was the cause of another guides much of what we understand about the world(and howwe act in it). Forexample, knowing whether itwastheaspirinthatcuredmyheadacheortheTVprogramIwaswatchingwould surely affectmy futureuse of aspirin. Likewise, to take an example fromcommon judicial standard, judgment in favor of a plaintiff should be made if and only if it is “more probable than not” that the damage would not have occurred but for the defendant’saction(Robertson, 1997).

These two examples fall under the category of “causes of effects” because theyconcernsituationsinwhichweobserveboththeeffect,Y =y,andtheputative causeX =xandweareaskedtoassess, counterfactually,whethertheformerwould have occurredabsent the latter.

Wehaveremarkedearlier(footnote8)thatcounterfactualprobabilitiescon- ditionedon theoutcome cannotin generalbe identifiedfromobservationaloreven experimental studies. This does not mean however that such probabilities are use- less or void of empirical content; the structural perspective may guide us in fact towarddiscoveringtheconditionsunderwhichtheycanbeassessed fromdata,thus definingtheempiricalcontentofthese counterfactuals.

Following the 4-step process of structural methodology – define, assume, identify, and estimate – our first step is to express the target quantity in counter- factual notation and verify that it is well defined, namely, that it can be computed unambiguouslyfromany fully-specifiedcausal model.

In our case, this step is simple. Assuming binary events, with X = x and Y = y representing treatment and outcome, respectively, and X =x′, Y = y′ their negations,ourtargetquantitycanbeformulateddirectlyfromtheEnglishsentence: “Find the probability that Y would be y′ had X been x′, given that, in reality,Y is actuallyyand X is x,” togive: PN(x,y)=P(Y =y′|X =x,Y =y) (60) x′ 19 Suchsymbiosisisnowstandardinepidemiologyresearch(Robins,2001,Petersenetal.,2006, VanderWeele andRobins,2007,Hafeman andSchwartz,2009,VanderWeele, 2009)andismaking itswayslowlytowardthesocialandbehavioralsciences.

This counterfactual quantity, which Robins and Greenland (1989b) named “probability of causation” and Pearl (2000a, p. 296) named “probability of neces- sity” (PN), to be distinguished from two other nuances of “causation,” is certainly computablefromanyfullyspecifiedstructuralmodel,i.e.,oneinwhichP(u)andall functional relationships are given. This follows from the fact that every structural modeldefines ajointdistributionofcounterfactuals,throughEq. (29).

Having written a formal expression for PN, Eq. (60), we can move on to the formulation and identification phases and ask what assumptions would permit us to identify PN from empirical studies, be they observational, experimental or a combinationthereof.

This problem was analyzed in Pearl (2000a, Chapter 9) and yielded the followingresults: Theorem 4 IfY is monotonic relative to X, i.e., Y1(u)≥Y0(u), thenPN is identifi- ablewhenever thecausal effectP(y|do(x)) is identifiableand, moreover, P(y|x)−P(y|x′) P(y|x′)−P(y|do(x′)) PN= + . (61) P(y|x) P(x,y) The first term on the r.h.s. of (61) is the familiar excess risk ratio (ERR) that epi- demiologists have been using as a surrogate for PN in court cases (Cole, 1997, Robins and Greenland, 1989b). The second term represents the correction needed toaccount forconfoundingbias, thatis,P(y|do(x′))̸=P(y|x′).

Thissuggeststhatmonotonicityandunconfoundednessweretacitlyassumed by the many authors who proposed or derived ERR as a measure for the “fraction ofexposed cases that areattributabletothe exposure”(Greenland,1999).

Equation(61)thusprovidesamorerefinedmeasureofcausation,whichcan beusedinsituationswherethecausaleffectP(y|do(x))canbeestimatedfromeither randomized trials or graph-assisted observationalstudies (e.g., through Theorem 3 or Eq. (25)). It can also be shown (Tian and Pearl, 2000) that the expression in (61) provides a lower bound for PN in the general, nonmonotonic case. (See also (RobinsandGreenland,1989a).) Inparticular,thetightupperandlowerboundson PNaregiven by: P(y)−P(y|do(x′)) P(y′|do(x′))−P(x′,y′) max 0, ≤PN ≤min 1, (62) ! P(x,y) ” ! P(x,y) ” It is worth noting that, in drug related litigation, it is not uncommon to ob- tain data from both experimental and observational studies. The former is usually available at the manufactureror the agency that approved the drug for distribution (e.g., FDA), while the latteris easy to obtainby randomsurveys of thepopulation.

In such cases, the standard lower bound used by epidemiologists to establish le- gal responsibility, the Excess Risk Ratio, can be improved substantially using the corrective term of Eq. (61). Likewise, the upper bound of Eq. (62) can be used to exonerate drug-makers from legal responsibility. Cai and Kuroki (2006) analyzed thestatistical propertiesofPN.

Pearl (2000a, p. 302) shows that combining data from experimental and observational studies which, taken separately, may indicate no causal relations be- tween X and Y, can nevertheless bring the lower bound of Eq. (62) to unity, thus implyingcausation withprobabilityone.

Suchextremeresultsdispelallfearsandtrepidationsconcerningtheempiri- calcontentofcounterfactuals(Dawid,2000,Pearl,2000b). Theydemonstratethata quantityPN whichat first glance appears tobe hypothetical,ill-defined,untestable and, hence, unworthy of scientific analysis is nevertheless definable, testable and, in certain cases, even identifiable. Moreover, the fact that, under certain combina- tionofdata, andmakingnoassumptionswhatsoever,animportantlegalclaimsuch as “the plaintiffwould be alive had he not taken the drug” can be ascertained with probabilityapproachingone, isa remarkabletributetoformalanalysis.

Anothercounterfactualquantitythathasbeenfullycharacterizedrecentlyis theEffectofTreatment ontheTreated (ETT): ETT =P(Y =y|X =x′) x ETT hasbeenusedineconometricstoevaluatetheeffectivenessofsocialprograms on their participants (Heckman, 1992) and has long been the target of research in epidemiology, where it came to be known as “the effect of exposure on the exposed,” or “standardized morbidity” (Miettinen, 1974; Greenland and Robins, 1986).

ShpitserandPearl(2009)havederiveda completecharacterizationofthose models in which ETT can be identified from either experimental or observational studies. They have shown that, despite its blatant counterfactual character, (e.g., “I just took an aspirin, perhaps I shouldn’t have?”) ETT can be evaluated from experimental studies in many, though not all cases. It can also be evaluated from observational studies whenever a sufficient set of covariates can be measured that satisfies the back-doorcriterionand, more generally, in a wide class of graphs that permittheidentificationofconditionalinterventions.

Theseresultsfurtherilluminatetheempiricalcontentofcounterfactualsand their essential role in causal analysis. They prove once again the triumph of logic andanalysisovertraditionsthata-prioriexcludefromtheanalysisquantitiesthatare nottestableinisolation. Mostofall,theydemonstratetheeffectivenessandviability ofthe scientific approach to causation wherebythe dominantparadigm is tomodel the activities of Nature, rather than those of the experimenter. In contrast to the rulingparadigmofconservativestatistics,webeginwithrelationshipsthatweknow inadvancewillneverbeestimated, testedorfalsified. Onlyafterassembling ahost ofsuch relationshipsandjudgingthemtofaithfullyrepresentourtheoryabouthow Nature operates, we ask whether the parameter ofinterest, crisply definedin terms ofthosetheoreticalrelationships,canbeestimatedconsistentlyfromempiricaldata andhow. Itoftendoes, to thecreditofprogressivestatistics.

7. Conclusions

Traditional statistics is strong in devising ways of describing data and inferring

distributional parameters from sample. Causal inference requires two additional ingredients: a science-friendly language for articulating causal knowledge, and a mathematical machinery for processing that knowledge, combining it with data and drawing new causal conclusions about a phenomenon. This paper surveys re- centadvancesincausalanalysisfromtheunifyingperspectiveofthestructuralthe- ory of causation and shows how statistical methods can be supplemented with the needed ingredients. The theory invokes non-parametric structural equations mod- elsas aformalandmeaningfullanguagefordefiningcausal quantities,formulating causal assumptions, testing identifiability, and explicating many concepts used in causal discourse. These include: randomization, intervention, direct and indirect effects, confounding, counterfactuals, and attribution. The algebraic component of the structurallanguage coincides with the potential-outcomeframework,and its graphical component embraces Wright’s method of path diagrams. When unified and synthesized, the two components offer statistical investigators a powerful and comprehensivemethodologyforempiricalresearch.

References

Angrist, J., G. Imbens, and D. Rubin (1996): “Identification of causal effects us-

inginstrumentalvariables(withcomments),”JournaloftheAmerican Statistical Association, 91, 444–472.

Arah, O. (2008): “The role of causal reasoning in understanding Simp- son’s paradox, Lord’s paradox, and the suppression effect: Covariate se- lection in the analysis of observational studies,” Emerging Themes in Epidemiology, 4, doi:10.1186/1742–7622–5–5, online at <http://www.ete- online.com/content/5/1/5>.

Arjas, E. and J. Parner (2004): “Causal reasoning fromlongitudinaldata,” Scandi- navianJournalofStatistics, 31, 171–187.

Avin, C., I. Shpitser, and J. Pearl (2005): “Identifiability of path-specific effects,” inProceedings oftheNineteenth InternationalJointConference onArtificial In- telligenceIJCAI-05, Edinburgh,UK:Morgan-KaufmannPublishers, 357–363.

Balke, A. and J. Pearl (1995): “Counterfactuals and policy analysis in structural models,” in P. Besnard and S. Hanks, eds., Uncertainty in Artificial Intelligence 11,San Francisco: MorganKaufmann,11–18.

Balke, A. and J. Pearl (1997): “Bounds on treatment effects fromstudies with im- perfect compliance,” Journal of the American Statistical Association, 92, 1172– 1176.

Baron, R. and D. Kenny (1986): “The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considera- tions,”Journalof PersonalityandSocialPsychology, 51,1173–1182.

Berkson, J. (1946): “Limitations of the application of fourfold table analysis to hospitaldata,” Biometrics Bulletin,2, 47–53.

Bollen, K. (1989): Structural Equations with Latent Variables, New York: John Wiley.

Brent, R. andL. Lok (2005): “A fishingbuddyforhypothesisgenerators,”Science, 308,523–529.

Cai, Z. and M. Kuroki(2006): “Variance estimators forthree‘probabilitiesofcau- sation’,”Risk Analysis, 25, 1611–1620.

Cai,Z.andM.Kuroki(2008): “Onidentifyingtotaleffectsinthepresenceoflatent variables and selection bias,” in D. A. McAllester and P. Myllyma¨ki, eds., Un- certaintyinArtificialIntelligence,ProceedingsoftheTwenty-FourthConference, Arlington,VA: AUAI, 62–69.

Cartwright, N. (2007): Hunting Causes and Using Them: Approaches in Philoso- phyandEconomics, NewYork,NY: CambridgeUniversityPress.

Chalak, K. and H. White (2006): “An extended class of instrumental variables for theestimationofcausaleffects,”TechnicalReportDiscussionPaper,UCSD,De- partmentofEconomics.

Chickering, D. and J. Pearl (1997): “A clinician’s tool for analyzing non- compliance,”ComputingScienceand Statistics,29, 424–431.

Cole, P. (1997): “Causality in epidemiology, health policy, and law,” Journal of MarketingResearch, 27, 10279–10285.

Cole, S. and M. Herna´n (2002): “Fallibility in estimating direct effects,” Interna- tionalJournal ofEpidemiology,31, 163–165.

Cox, D. (1958): ThePlanningofExperiments, NY: John WileyandSons.

Cox, D. and N. Wermuth (2004): “Causality: A statistical view,” International StatisticalReview, 72, 285–305.

Dawid, A. (1979): “Conditional independence in statistical theory,” Journal of the RoyalStatisticalSociety, Series B, 41,1–31.

Dawid, A. (2000): “Causal inference without counterfactuals(with comments and rejoinder),”JournaloftheAmerican StatisticalAssociation, 95,407–448.

Dawid, A. (2002): “Influencediagrams forcausal modellingand inference,” Inter- nationalStatisticalReview, 70, 161–189.

Duncan, O. (1975): Introduction to Structural Equation Models, New York: Aca- demicPress.

Eells, E. (1991): Probabilistic Causality, Cambridge, MA: Cambridge University Press.

Frangakis, C. and D. Rubin (2002): “Principal stratification in causal inference,” Biometrics, 1,21–29.

Glymour, M. and S. Greenland (2008): “Causal diagrams,” in K. Rothman, S. Greenland, and T. Lash, eds., Modern Epidemiology, Philadelphia, PA: Lip- pincottWilliams&Wilkins, 3rdedition, 183–209.

Goldberger,A. (1972): “Structuralequationmodelsin thesocial sciences,” Econo- metrica: Journalofthe Econometric Society,40, 979–1001.

Goldberger, A. (1973): “Structural equation models: An overview,” in A. Gold- berger and O. Duncan, eds., Structural Equation Models in the Social Sciences, NewYork,NY:SeminarPress, 1–18.

Greenland, S. (1999): “Relation ofprobability ofcausation, relative risk, and dou- bling dose: A methodologic error that has become a social problem,” American JournalofPublic Health,89, 1166–1169.

Greenland, S., J. Pearl, and J. Robins (1999): “Causal diagrams for epidemiologic research,”Epidemiology,10, 37–48.

Greenland, S. and J. Robins (1986): “Identifiability,exchangeability, and epidemi- ologicalconfounding,”InternationalJournal ofEpidemiology,15, 413–419.

Haavelmo, T. (1943): “The statistical implications of a system of simultaneous equations,” Econometrica, 11, 1–12, reprintedin D.F. Hendry and M.S. Morgan (Eds.), The Foundations of Econometric Analysis, Cambridge University Press, 477–490,1995.

Hafeman,D.andS.Schwartz(2009): “Openingtheblackbox: Amotivationforthe assessment ofmediation,”InternationalJournalofEpidemiology,3, 838–845.

Halpern, J. (1998): “Axiomatizing causal reasoning,” in G. Cooper and S. Moral, eds., Uncertainty in Artificial Intelligence, San Francisco, CA: Morgan Kauf- mann, 202–210, also, Journal of Artificial Intelligence Research 12:3, 17–37, 2000.

Heckman, J. (1992): “Randomization and social policy evaluation,” in C. Manski andI. Garfinkle,eds., Evaluations: Welfare andTraining Programs, Cambridge, MA:HarvardUniversityPress, 201–230.

Heckman,J.(2005): “Thescientificmodelofcausality,”SociologicalMethodology, 35,1–97.

Heckman, J. (2008): “Econometric causality,” InternationalStatisticalReview, 76, 1–27.

Heckman, J. and S. Navarro-Lozano (2004): “Using matching, instrumental vari- ables,andcontrolfunctionstoestimateeconomicchoicemodels,”TheReview of Economics andStatistics, 86, 30–57.

Heckman, J. and E. Vytlacil (2005): “Structural equations, treatment effects and econometricpolicy evaluation,”Econometrica, 73,669–738.

Herna´n, M. and S. Cole (2009): “Invited commentary: Causal diagrams and mea- surementbias,” American JournalofEpidemiology,170, 959–962.

Holland, P. (1988): “Causal inference,path analysis, andrecursive structuralequa- tions models,” in C. Clogg, ed., Sociological Methodology, Washington, D.C.: AmericanSociologicalAssociation, 449–484.

Hurwicz, L. (1950): “Generalization of the concept of identification,” in T. Koop- mans, ed., StatisticalInference in Dynamic Economic Models, Cowles Commis- sion, Monograph10,New York: Wiley,245–257.

Imai, K., L. Keele, and T. Yamamoto (2008): “Identification, inference, and sen- sitivity analysis for causal mediation effects,” Technical report, Department of Politics,PrinctonUniversity.

Imbens, G. and J. Wooldridge (2009): “Recent developments in the econometrics ofprogramevaluation,”JournalofEconomic Literature, 47, 5–86.

Judd, C. and D. Kenny (1981): “Process analysis: Estimating mediation in treat- mentevaluations,”EvaluationReview, 5, 602–619.

Kaufman, S., J. Kaufman, and R. MacLenose (2009): “Analytic bounds on causal risk differences in directed acyclic graphs involving three observed binary vari- ables,” JournalofStatisticalPlanningandInference, 139, 3473–3487.

Kiiveri, H., T. Speed, and J. Carlin (1984): “Recursive causal models,” Journal of Australian MathSociety, 36, 30–52.

Koopmans, T. (1953): “Identification problems in econometric model construc- tion,” in W. Hood and T. Koopmans, eds., Studies in Econometric Method, New York: Wiley, 27–48.

Kuroki, M. and M. Miyakawa (1999): “Identifiability criteria for causal effects of jointinterventions,”Journal oftheRoyal StatisticalSociety,29, 105–117.

Lauritzen, S.(1996): Graphical Models, Oxford: ClarendonPress.

Lauritzen, S. (2001): “Causal inference from graphical models,” in D. Cox and C. Kluppelberg, eds., Complex Stochastic Systems, Boca Raton, FL: Chapman andHall/CRC Press, 63–107.

Lindley, D. (2002): “Seeing and doing: The concept of causation,” International StatisticalReview, 70, 191–214.

MacKinnon, D., A. Fairchild, and M. Fritz (2007a): “Mediation analysis,” Annual Review ofPsychology, 58, 593–614.

MacKinnon,D.,C.Lockwood,C.Brown,W.Wang,andJ.Hoffman(2007b): “The intermediateendpointeffectinlogisticand probitregression,” ClinicalTrials, 4, 499–513.

Manski, C. (1990): “Nonparametric bounds on treatment effects,” American Eco- nomicReview, Papers andProceedings, 80, 319–323.

Marschak,J.(1950): “Statisticalinferenceineconomics,”inT.Koopmans,ed.,Sta- tisticalInferenceinDynamicEconomicModels, NewYork: Wiley,1–50,cowles CommissionforResearch inEconomics, Monograph10.

Meek, C. and C. Glymour(1994): “Conditioningand intervening,”British Journal ofPhilosophy Science, 45,1001–1021.

Miettinen, O. (1974): “Proportionofdisease caused or prevented by a given expo- sure, trait,orintervention,”Journalof Epidemiology,99, 325–332.

Morgan, S. and C. Winship (2007): Counterfactuals and Causal Inference: Meth- odsandPrinciplesforSocialResearch(AnalyticalMethodsforSocialResearch), NewYork,NY:CambridgeUniversityPress.

Muller, D., C. Judd, and V. Yzerbyt (2005): “When moderation is mediated and mediationismoderated,”JournalofPersonalityandSocialPsychology,89,852– 863.

Neyman, J. (1923): “On theapplication ofprobabilitytheoryto agriculturalexper- iments. Essay onprinciples. Section9,”StatisticalScience, 5, 465–480.

Pearl, J. (1988): Probabilistic Reasoning in Intelligent Systems, San Mateo, CA: MorganKaufmann.

Pearl, J. (1993a): “Comment: Graphical models, causality, and intervention,” Sta- tisticalScience, 8, 266–269.

Pearl, J. (1993b): “Mediating instrumental variables,” Technical Report R-210, <http://ftp.cs.ucla.edu/pub/stat ser/R210.pdf>, Department of Computer Sci- ence, UniversityofCalifornia,Los Angeles.

Pearl, J. (1995): “Causal diagrams for empirical research,” Biometrika, 82, 669– 710.

Pearl, J. (1998): “Graphs, causality, and structural equation models,” Sociological Methods andResearch, 27,226–284.

Pearl, J. (2000a): Causality: Models, Reasoning, and Inference, New York: Cam- bridgeUniversityPress, second ed., 2009.

Pearl, J. (2000b): “Comment on A.P. Dawid’s, Causal inference without counter- factuals,”JournaloftheAmerican StatisticalAssociation, 95,428–431.

Pearl, J. (2001): “Direct and indirect effects,” in Proceedings of the Seventeenth ConferenceonUncertaintyinArtificialIntelligence,SanFrancisco,CA:Morgan Kaufmann,411–420.

Pearl, J. (2004): “Robustness of causal claims,” in M. Chickering and J. Halpern, eds., Proceedings of the Twentieth Conference Uncertainty in Artificial Intelli- gence,Arlington,VA:AUAI Press, 446–453.

Pearl, J. (2009a): “Causal inference in statistics: An overview,” Statistics Surveys, 3, 96–146,http://www.i-journals.org/ss/viewarticle.php?id=57.

Pearl, J. (2009b): Causality: Models, Reasoning, and Inference, New York: Cam- bridgeUniversityPress, second edition.

Pearl, J. (2009c): “Letter to the editor: Remarks on the method of propensity scores,” Statistics in Medicine, 28, 1415–1416, <http://ftp.cs.ucla.edu/pub/stat ser/r345-sim.pdf>.

Pearl, J. (2009d): “Myth, confusion, and science in causal analysis,” Technical Report R-348, Department of Computer Science, University of California, Los Angeles, CA, <http://ftp.cs.ucla.edu/pub/stat ser/r348.pdf>.

Pearl,J. (2009e): “Onaclass ofbias-amplifyingcovariatesthatendangereffectes- timates,” Technical Report R-346, Department of Computer Science, University ofCalifornia,Los Angeles, CA, <http://ftp.cs.ucla.edu/pub/stat ser/r346.pdf>.

Pearl, J. (2009f): “On measurement bias in causal inference,” Technical Report R-357, <http://ftp.cs.ucla.edu/pub/stat ser/r357.pdf>, Department of Computer Science, UniversityofCalifornia,Los Angeles.

Pearl, J. and A. Paz (2009): “Confounding equivalence in observational studies,” Technical Report R-343, Department of Computer Science, University of Cali- fornia,Los Angeles, CA,<http://ftp.cs.ucla.edu/pub/stat ser/r343.pdf>.

Pearl, J. and J. Robins (1995): “Probabilistic evaluation of sequential plans from causal models with hidden variables,” in P. Besnard and S. Hanks, eds., Uncer- taintyinArtificialIntelligence11,San Francisco: MorganKaufmann, 444–453.

Pearl,J.andT.Verma(1991):“Atheoryofinferredcausation,”inJ.Allen,R.Fikes, andE. Sandewall,eds., Principles ofKnowledgeRepresentationandReasoning: Proceedings of the Second International Conference, San Mateo, CA: Morgan Kaufmann,441–452.

Petersen, M., S. Sinisi, and M. van der Laan (2006): “Estimation of direct causal effects,”Epidemiology,17, 276–284.

Richard, J. (1980): “Models with several regimes and changes in exogeneity,” Re- view ofEconomic Studies,47, 1–20.

Robertson,D.(1997): “Thecommonsenseofcauseinfact,”TexasLawReview, 75, 1765–1800.

Robins, J. (1986): “A new approach to causal inference in mortality studies with a sustained exposure period– applications to controlof thehealthy workerssur- vivoreffect,”MathematicalModeling, 7, 1393–1512.

Robins, J. (1987): “A graphical approach to the identification and estimation of causal parametersinmortalitystudies withsustained exposure periods,”Journal ofChronicDiseases, 40, 139S–161S.

Robins, J. (1989): “The analysis of randomized and non-randomized aids treat- ment trials using a new approach to causal inference in longitudinal studies,” in L. Sechrest, H. Freeman, and A. Mulley, eds., Health Service Research Method- ology: A Focus on AIDS, Washington, D.C.: NCHSR, U.S. Public Health Ser- vice, 113–159.

Robins, J. (1999): “Testing and estimation of directed effects by reparameterizing directed acyclic with structural nested models,” in C. Glymour and G. Cooper, eds., Computation, Causation, and Discovery, Cambridge, MA: AAAI/MIT Press, 349–405.

Robins, J. (2001): “Data, design, and background knowledge in etiologic infer- ence,” Epidemiology,12,313–320.

Robins, J. and S. Greenland (1989a): “Estimability and estimation of excess and etiologicfractions,”Statistics inMedicine, 8, 845–859.

Robins,J.andS.Greenland(1989b): “Theprobabilityofcausationunderastochas- ticmodelforindividualrisk,”Biometrics, 45, 1125–1138.

Robins, J. and S. Greenland (1992): “Identifiabilityand exchangeability for direct andindirecteffects,”Epidemiology,3, 143–155.

Rosenbaum, P. (2002): Observational Studies, New York: Springer-Verlag,second edition.

Rosenbaum, P. and D. Rubin (1983): “The central role of propensity score in ob- servationalstudiesforcausal effects,”Biometrika, 70, 41–55.

Rothman, K.(1976): “Causes,” American Journal ofEpidemiology,104, 587–592.

Rubin, D. (1974): “Estimatingcausal effectsoftreatmentsin randomizedand non- randomizedstudies,” Journalof EducationalPsychology, 66, 688–701.

Rubin,D.(2004): “Directandindirectcausaleffectsviapotentialoutcomes,”Scan- dinavianJournalofStatistics, 31, 161–170.

Rubin, D. (2005): “Causal inference using potential outcomes: Design, modeling, decisions,” JournaloftheAmerican StatisticalAssociation, 100,322–331.

Rubin, D. (2007): “The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials,” Statistics in Medicine, 26, 20–36.

Rubin, D. (2009): “Author’sreply: Shouldobservationalstudies be designed toal- lowlack ofbalance incovariatedistributionsacross treatmentgroup?” Statistics inMedicine, 28, 1420–1423.

Shpitser, I. and J. Pearl (2006): “Identification of conditional interventional dis- tributions,” in R. Dechter and T. Richardson, eds., Proceedings of the Twenty- Second Conference on Uncertainty in Artificial Intelligence, Corvallis, OR: AUAIPress, 437–444.

Shpitser, I. and J. Pearl (2008): “Dormant independence,” in Proceedings of the Twenty-Third Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press, 1081–1087.

Shpitser, I. and J. Pearl (2009): “Effects of treatment on the treated: Identification and generalization,” in Proceedings of the Twenty-Fifth Conference on Uncer- taintyinArtificialIntelligence,Montreal, Quebec: AUAIPress.

Shrier, I. (2009): “Letter to the editor: Propensity scores,” Statistics in Medicine, 28, 1317–1318, see also Pearl 2009 <http://ftp.cs.ucla.edu/pub/stat ser/r348.pdf>.

Shrout, P. and N. Bolger (2002): “Mediation in experimental and nonexperimen- tal studies: New procedures and recommendations,” Psychological Methods, 7, 422–445.

Simon,H.(1953): “Causalorderingandidentifiability,”inW.C.HoodandT.Koop- mans,eds.,StudiesinEconometricMethod,NewYork,NY:WileyandSons,Inc., 49–74.

Simon, H. and N. Rescher (1966): “Cause and counterfactual,” Philosophy and Science, 33, 323–340.

Sobel, M. (1998): “Causal inference in statistical models of the process of socioe- conomicachievement,”SociologicalMethods &Research, 27, 318–348.

Sobel, M. (2008): “Identification of causal parameters in randomized studies with mediatingvariables,”JournalofEducationalandBehavioral Statistics,33,230– 231.

Spirtes,P.,C.Glymour,andR.Scheines(1993): Causation,Prediction,andSearch, NewYork: Springer-Verlag.

Spirtes,P.,C.Glymour,andR.Scheines(2000): Causation,Prediction,andSearch, Cambridge,MA: MITPress, 2nd edition.

Stock,J.andM.Watson(2003): IntroductiontoEconometrics, NewYork: Addison Wesley.

Strotz,R.andH.Wold(1960): “Recursiveversusnonrecursivesystems: Anattempt atsynthesis,” Econometrica, 28,417–427.

Suppes,P.(1970): AProbabilisticTheoryofCausality,Amsterdam: North-Holland PublishingCo.

Tian, J., A. Paz, and J. Pearl (1998): “Finding minimalseparating sets,” Technical ReportR-254,UniversityofCalifornia,Los Angeles, CA.

Tian,J.andJ.Pearl(2000): “Probabilitiesofcausation: Boundsandidentification,” Annals ofMathematics andArtificialIntelligence,28, 287–313.

Tian, J. and J. Pearl (2002): “A general identificationcondition forcausal effects,” in Proceedings of the Eighteenth National Conference on Artificial Intelligence, MenloPark, CA:AAAIPress/The MIT Press, 567–573.

VanderWeele, T. (2009): “Marginal structural models for the estimation of direct andindirecteffects,”Epidemiology,20, 18–26.

VanderWeele, T. and J. Robins (2007): “Four types of effect modification: A clas- sificationbased ondirectedacyclic graphs,”Epidemiology,18, 561–568.

Verma, T. and J. Pearl (1990): “Equivalence and synthesis of causal models,” in Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence, Cambridge,MA,220–227,alsoinP.Bonissone,M.Henrion,L.N.KanalandJ.F.

Lemmer(Eds.),UncertaintyinArtificialIntelligence6,ElsevierSciencePublish- ers, B.V., 255–268, 1991.

Wermuth,N. (1992): “On block-recursiveregression equations,”Brazilian Journal ofProbabilityandStatistics (withdiscussion), 6, 1–56.

Wermuth, N. and D. Cox (1993): “Linear dependencies represented by chain graphs,”StatisticalScience, 8, 204–218.

Whittaker, J. (1990): Graphical Models in Applied Multivariate Statistics, Chich- ester, England: John Wiley.

Wilkinson, L., the Task Force on Statistical Inference, and APA Board of Scien- tific Affairs (1999): “Statistical methods in psychology journals: Guidelines and explanations,”American Psychologist, 54,594–604.

Woodward, J. (2003): Making Things Happen, New York, NY: Oxford University Press.

Wooldridge, J. (2002): Econometric Analysis of Cross Section and Panel Data, CambridgeandLondon: MIT Press.

Wooldridge, J. (2009): “Should instrumental variables be used as matching variables?” Technical Report https://www.msu.edu/∼ec/faculty/wooldridge/current%20research/treat1r6.pdf, MichiganStateUniversity,MI.

Wright, S. (1921): “Correlation and causation,” Journal of Agricultural Research, 20,557–585.

Back to top

This work © 2025 by Sungkyun Cho is licensed under CC BY-NC-SA 4.0